Challenges in NMT - 2004.05809
Challenges in NMT - 2004.05809
Machine translation (MT) is a technique that leverages computers to translate human languages automatically. Nowadays, neural
machine translation (NMT) which models direct mapping between source and target languages with deep neural networks has
achieved a big breakthrough in translation performance and become the de facto paradigm of MT. This article makes a review
of NMT framework, discusses the challenges in NMT, introduces some exciting recent progresses and finally looks forward to
some potential future research trends. In addition, we maintain the state-of-the-art methods for various NMT tasks at the website
https://github.com/ZNLP/SOTA-MT.
neural machine translation, Transformer, multimodal translation, low-resource translation, document translation
until the introduction of deep learning into MT. Since 2014, 𝑦! 𝑦" 𝑦# 𝑦$ 𝑦%
neural machine translation (NMT) based on deep neural net-
Decoder
works has quickly developed [8, 45, 116, 122]. In 2016,
through extensive experiments on various language pairs,
[65, 143] demonstrated that NMT has made a big break-
through and obtained remarkable improvements compared to
SMT, and even approached human-level translation quality
Encoder
[55]. This article attempts to give a review of NMT frame-
work, discusses some challenging research tasks in NMT, in-
troduces some exciting progresses and forecasts several fu-
ture research topics.
The remainder of this article is organized as follows: Sec. 2 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥&
first introduces the background and state-of-the-art paradigm
Figure 1 encoder-decoder framework for neural machine translation. The
of NMT. In Sec. 3 we discuss the key challenging research encoder encodes the input sequence x0 x1 x2 x3 x4 x5 into distributed seman-
tasks in NMT. From Sec. 4 to Sec. 7, the recent progresses tic representations based on which the decoder produces an output sequence
are presented concerning each challenge. Sec. 8 discusses y0 y1 y2 y3 y4 .
Neural machine translation is an end-to-end model follow- in which V denotes the vocabulary of the target language and
ing an encoder-decoder framework that usually includes g(·) is a non-linear function that calculates a real-valued score
two neural networks for encoder and decoder respectively for the prediction yi conditioned on the input x, the partial
[8,45,116,122]. As shown in Fig. 1, the encoder network first translation y<i and the model parameters θ. The non-linear
maps each input token of the source-language sentence into function g(·) is realized through the encoder and decoder net-
a low-dimensional real-valued vector (aka word embedding) works. The input sentence x is abstracted into hidden seman-
and then encodes the sequence of vectors into distributed se- tic representations h through multiple encoder layers. y<i is
mantic representations, from which the decoder network gen- summarized into the target-side history context representa-
erates the target-language sentence token by token 1) from left tion z with decoder network which further combines h and z
to right. using an attention mechanism to predict the score of yi .
From the probabilistic perspective, NMT models the con-
The network parameters θ can be optimized to maxi-
ditional probability of the target-language sentence y =
mize the log-likelihood over the bilingual training data D =
y0 , · · · , yi , · · · , yI given the source-language sentence x =
{(x(m) , y (m) )}m=1
M
:
x0 , · · · , x j , · · · , x J as a product of token-level translation
probabilities.
M
X
I
Y θ = argmaxθ∗ logP(y (m) |x(m) , θ∗ ) (3)
P(y|x, θ) = P(yi |x, y<i , θ) (1) m=1
i=0
where y<i = y0 , · · · , yi−1 is the partial translation which has These years have witnessed the fast development of the
been generated so far. x0 , y0 and x J , yI are often special sym- encoder-decoder networks from recurrent neural network
bols <s> and </s> indicating the start and end of a sentence [8, 116], to convolutional neural network [45] and then to
respectively. self-attention based neural network Transformer [122]. At
The token-level translation probability can defined as fol- present, Transformer is the state-of-the-art in terms of both
lows: quality and efficiency.
1) Currently, subword is the most popular translation token for NMT [109].
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 3
2.2 Transformer layer. Residual connection and layer normalization are per-
formed for each sub-layer in both of the encoder and decoder.
It is easy to notice that the attention mechanism is the
Output key component. There are three kinds of attention mecha-
Probabilities
nisms, including encoder self-attention, decoder masked self-
Softmax attention and encoder-decoder attention. They can be formal-
ized into the same formula.
Linear
qK T
Add&Norm
Attention(q, K, V ) = softmax √ V (4)
dk
Feed where q, K and V stand for a query, the key list and the
Forward
value list respectively. dk is the dimension of the key.
For the encoder self-attention, the queries, keys and values
Add&Norm
Add&Norm are from the same layer. For example, considering we cal-
Feed Multi-head
N culate the output of the first layer in the encoder at the j-th
Inter-Attention
Forward position, let x j be the sum vector of input token embedding
and the positional embedding. The query is vector x j . The
N
Add&Norm Add&Norm keys and values are the same and both are the embedding ma-
trix x = [x0 · · · x J ]. Then, multi-head attention is proposed
Multi-head Intra- Mask Multi-head
Attention Intra-Attention to calculate attentions in different subspaces.
3 Key challenging research tasks man reading and writing. It is also easy for training and in-
ference. However, it has several drawbacks. On one hand,
Although Transformer has significantly advanced the devel- the decoding efficiency is quite limited since the i-th transla-
opment of neural machine translation, many challenges still tion token can be predicted only after all the previous i − 1
remain to be addressed. Obviously, designing better NMT predictions have been generated. On the other hand, predict-
framework must be the most important challenge. However, ing the i-th token can only access the previous history pre-
since the innovation of Transformer, almost no more effec- dictions while cannot utilize the future context information
tive NMT architecture has been proposed. [22] presented an in autoregressive manner, leading to inferior translation qual-
alternative encoder-decoder framework RNMT+ which com- ity. Thus, it is a challenge how to break the autoregressive
bines the merits of RNN-based and Transformer-based mod- inference constraint. Non-autoregressive decoding and bidi-
els to perform translation. [129, 150] investigated how to de- rectional inference are two solutions from the perspectives of
sign much deeper Transformer model and [78] presented a efficiency and quality respectively.
Reformer model enabling rich interaction between encoder
and decoder. [141] attempted to replace self-attention with 3. Low-resource translation. There are thousands of hu-
dynamic convolutions. [112] proposed the evolved Trans- man languages in the world and abundant bitexts are only
former using neural architecture search. [83] aimed to im- available in a handful of language pairs such as English-
prove Transformer from the perspective of multi-particle dy- German, English-French and English-Chinese. Even in the
namic system. Note that these models do not introduce big resource-rich language pair, the parallel data are unbalanced
change on the NMT architecture. Pursuing to design novel since most of the bitexts mainly exist in several domains (e.g.
and more effective NMT framework will be a long way to go. news and patents). That is to say, the lack of parallel training
In this section, we analyze and discuss the key challenges3) corpus is very common in most languages and domains. It is
facing NMT from its formulation. well-known that neural network parameters can be well opti-
mized on highly repeated events (frequent word/phrase trans-
From the introduction in Sec. 2.1, NMT is formally de-
lation pairs in the training data for NMT) and the standard
fined as a sequence-to-sequence prediction task in which four
NMT model will be poorly learned on low-resource language
assumptions are hidden in default. First, the input is a sen-
pairs. As a result, how to make full use of the parallel data
tence rather than paragraphs and documents. Second, the
in other languages (pivot-based translation and multilingual
output sequence is generated in a left-to-right autoregressive
translation) and how to take full advantage of non-parallel
manner. Third, the NMT model is optimized over the bilin-
data (semi-supervised translation and unsupervised transla-
gual training data which should include large-scale parallel
tion) are two challenges facing NMT.
sentences in order to learn good network parameters. Fourth,
the processing objects of NMT are the pure texts (tokens, 4. Multimodal neural machine translation. Intuitively,
words and sentences) instead of speech and videos. Accord- human language is not only about texts and understanding
ingly, four key challenges can be summarized as follows: the meaning of a language may need the help of other modal-
1. Document neural machine translation. In NMT for- ities such as speech, image and videos. Concerning the well-
mulation, sentence is the basic input for modeling. However, known example that determines the meaning of the word
some words in the sentence are ambiguous and the sense can bank when translating the sentence ”he went to the bank”, it
only be disambiguated with the context of surrounding sen- will be correctly translated if we are given an image in which
tences or paragraphs. And when translating a document, we a man is approaching a river. Furthermore, in many scenar-
need to guarantee the same terms in different sentences lead ios, we are required to translate a speech or a video. For ex-
to the same translation while performing translation sentence ample, simultaneous speech translation is more and more de-
by sentence independently cannot achieve this goal. More- manding in various conferences or international live events.
over, many discourse phenomena such as coreference, omis- Therefore, how to perform multimodal translation under the
sions and coherence, cannot be handled in the absence of encoder-decoder architecture is a big challenge of NMT. How
document-level information. Obviously, it is a big challenge to make full use of different modalities in multimodal trans-
how to take full advantage of contexts beyond sentences in lation and how to balance the quality and latency in simulta-
neural machine translation. neous speech translation are two specific challenges.
2. Non-autoregressive decoding and bidirectional infer-
ence. Left-to-right decoding token by token follows an au- In the following sections, we briefly introduce the recent
toregressive style which seems natural and is in line with hu- progress for each challenge.
3) [69, 80, 156] have also discussed various challenges.
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 5
target translation
Figure 3 illustration of two docNMT models. The left part shows the cascaded attention model proposed by [153] in which the previous source sentences
are first leveraged to enhance the representation of current source sentence and then used again in the decoder. The right part illustrates the two-pass docNMT
model proposed by [146] in which sentence-level NMT first generates preliminary translation for each sentence and then the first-pass translations together
with the source-side sentences are employed to generate the final translation results.
4 Document-level neural machine translation deep document-level information under the encoder-decoder
framework. It does not need to explicitly model specific dis-
As we discussed in Sec. 3 that performing translation sen- course phenomenon as that in SMT. According to the types
tence by sentence independently would introduce several of used document information, document-level neural ma-
risks. An ambiguous word may not be correctly translated chine translation (docNMT) can roughly fall into three cat-
without the necessary information in the surrounding con- egories: dynamic translation memory [71, 120], surrounding
textual sentences. A same term in different sentences in the sentences [61, 90, 125, 128, 146, 148, 153] and the whole doc-
same document may result in inconsistent translations. Fur- ument [87, 88, 117].
thermore, many discourse phenomena, such as coreference,
[120] presented a dynamic cache-like memory to maintain
omissions and cross-sentence relations, cannot be well han-
the hidden representations of previously translated words.
dled. In a word, sentence-level translation will harm the co-
The memory contains a fixed number of cells and each cell
herence and cohesion of the translated documents if we ig-
is a triple (ct , st , yt ) where yt is the prediction at the t-th step,
nore the discourse connections and relations between sen-
ct is the source-side context representation calculated by the
tences.
attention model and st is the corresponding decoder state.
In general, document-level machine translation (docMT)
During inference, when predicting the i-th prediction for a
aims at exploiting the useful document-level information
test sentence, ci is first obtained through attention model and
(multiple sentences around the current sentence or the whole
the probability p(ct |ci ) is computed based on their similar-
document) to improve the translation quality of the current
ity. Then memory context representation mi is calculated by
sentence as well as the coherence and cohesion of the trans-
linearly combining all the values st with p(ct |ci ). This cache-
lated document. docMT has already been extensively stud-
like memory can encourage the words in similar contexts to
ied in the era of statistical machine translation (SMT), in
share similar translations so that cohesion can be enhanced to
which most researchers mainly propose explicit models to
some extent.
address some specific discourse phenomena, such as lexical
cohesion and consistency [46, 144, 145], coherence [15] and The biggest difference between the use of whole document
coreference [104]. Due to complicate integration of multi- and surrounding sentences lies in the number of sentences
ple components in SMT, these methods modeling discourse employed as the context. This article mainly introduces the
phenomenon do not lead to promising improvements. methods exploiting surrounding sentences for docNMT. Rel-
The NMT model dealing with semantics and translation in evant experiments further show that subsequent sentences on
the distributed vector space facilitates the use of wider and the right contribute little to the translation quality of the cur-
6 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
rent sentence. Thus, most of the recent work aim at fully previous sentences performs poorly in handling the discourse
exploring the previous sentences to enhance docNMT. These phenomena while exploiting both source sentences and tar-
methods can be divided into two categories. One just utilizes get translations leads to the best performance. Accordingly,
the previous source-side sentences [61, 119, 128, 153]. The [123, 124] recently focused on designing better document-
other uses the previous source sentences as well as their tar- level NMT to improve on specific discourse phenomena such
get translations [90, 146]. as deixis, ellipsis and lexical cohesion for English-Russian
If only previous source-side sentences are leveraged, the translation.
previous sentences can be concatenated with the current sen-
tence as the input to the NMT model [119] or could be en-
5 Non-autoregressive decoding and bidirec-
coded into a summarized source-side context with a hierar-
tional inference
chical neural network [128]. [153] presented a cascaded at-
tention model to make full use of the previous source sen- Most NMT models follow the autoregressive generation style
tences. As shown in the left part of Fig. 3, previous sen- which produces output word by word from left to right. Just
tence is first encoded as the document-level context represen- as Sec. 3 discussed, this paradigm has to wait for i − 1 time
tation. When encoding the current sentence, each word will steps before starting to predict the i-th target word. Further-
attend to the document-level context and obtain a context- more, left-to-right autoregressive decoding cannot exploit the
enhanced source representation. During the calculation of target-side future context (future predictions after i-th word).
cross-language attention in the decoder, the current source Recently, many research work attempt to break this decod-
sentence together with the document-level context are both ing paradigm. Non-autoregressive Transformer (NAT) [49]
leveraged to predict the target word. The probability of trans- is proposed to remarkably lower down the latency by emit-
lation sentence given the current sentence and the previous ting all of the target words at the same time and bidirectional
context sentences is formulated as follows: inference [161, 172] is introduced to improve the translation
quality by making full use of both history and future contexts.
I
Y
P(y|x, doc x ; θ) = P(yi |y<i , x, doc x ; θ) (7) 5.1 Non-autoregressive decoding
i=0
where doc x denotes the source-side document-level context, Non-autoregressive Transformer (NAT) aims at producing an
namely previous sentences. entire target output in parallel. Different from the autore-
If both of previous source sentences and their translations gressive Transformer model (AT) which terminates decod-
are employed, two-pass decoding is more suitable for the ing when emitting an end-of-sentence token h/si, NAT has
docNMT model [146]. As illustrated in the right part of to know how many target words should be generated before
Fig. 3, the sentence-level NMT model can generate prelimi- parallel decoding. Accordingly, NAT calculates the condi-
nary translations for each sentence in the first-pass decoding. tional probability of a translation y given the source sentence
Then, the second-pass model will produce final translations x as follows:
with the help of source sentences and their preliminary trans-
lation results. The probability of the target sentence in the I
Y
second pass can be written by: PNAT (y|x; θ) = PL (I|x; θ) · P(yi |x; θ) (9)
i=0
I
Y To determine the output length, [49] proposed to use the
P(y|x, doc x , docy ; θ) = P(yi |y<i , x, doc x , docy ; θ) (8) fertility model which predicts the number of target words
i=0
that should be translated for each source word. We can per-
in which docy denotes the first-pass translations of doc x . form word alignment on the bilingual training data to obtain
Since most methods for docNMT are designed to boost the the gold fertilities for each sentence pair. Then, the fertility
overall translation quality (e.g. BLEU score), it still remains model can be trained together with the translation model. For
a big problem whether these methods indeed well handle the each source word x j , suppose the predicted fertility is Φ(x j ).
discourse phenomena. To address this issue, [10] conducted The output length will be I = Jj=0 Φ(x j ).
P
an empirical investigation of the docNMT model on the per- Another issue remains that AT let the previous generated
formance of processing various discourse phenomena, such output yi−1 be the input at the next time step to predict the i-th
as coreference, cohesion and coherence. Their findings indi- target word but NAT has no such input in the decoder net-
cate that multi-encoder model exploring only the source-side work. [49] found that translation quality is particularly poor
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 7
𝑦! 𝑦" 𝑦$ ⋯ 𝑦%
AT
NAT 𝑦! 𝑦" 𝑦$ ⋯ 𝑦%
𝑥! 𝑥" 𝑥$ ⋯ 𝑥# SAT
N
AT
-E
D
I
𝑦! 𝑦" 𝑦$ ⋯ 𝑦%
Figure 4 illustration of autoregressive NMT model and various non-autoregressive NMT models. AT denotes the conventional autoregressive NMT paradigm
in which the i-th prediction can fully utilize the partial translation of i − 1 words. NAT indicates the non-autoregreesive NMT model that generates all the target
words simultaneously. SAT is a variant of NAT which produces an ngram each time. NAT-EDI denotes the non-autoregressive NMT model with enhanced
decoder input which is generated by retrieving the phrase table.
if omitting the decoder input in NAT. To address this, they algorithm. Suppose the longest phrase in the phrase table
resort to the fertility model again and copy each source word contains K words. x0:K−1 is a phrase if it matches an entry
as many times as its fertility Φ(x j ) into the decoder input. in the phrase table. Otherwise they iteratively check x0:K−2 ,
The empirical experiments show that NAT can dramatically x0:K−3 and so on. If x0:h is a phrase, then they start to check
boost the decoding efficiency by 15× speedup compared to xh+1:h+K . After segmentation, each source phrase is mapped
AT. However, NAT severely suffers from accuracy degrada- into target translations which are concatenated together as the
tion. new decoder input, as shown in Fig. 4. Due to proper model-
The low translation quality may be due to at least two crit- ing of the decoder input with a highly efficient strategy, trans-
ical issues of NAT. First, there is no dependency between tar- lation quality is substantially improved while the decoding
get words although word dependency is ubiquitous in natu- speed is even faster than baseline NAT.
ral language generation. Second, the decoder inputs are the
copied source words which lie in different semantic space 5.2 Bidirectional inference
with target words. Recently, to address the shortcomings of
From the viewpoint of improving translation quality, autore-
the original NAT model, several methods are proposed to im-
gressive model can be enhanced by exploring the future con-
prove the translation quality of NAT while maintaining its
text on the right. In addition to predicting and estimating
efficiency [54, 75, 110, 127, 137, 139].
the future contexts with various models [151, 169, 171], re-
[127] proposed a semi-autoregressive Transformer model searchers find that left-to-right (L2R) and right-to-left (R2L)
(SAT) to combine the merits of both AT and NAT. SAT keeps autoregressive models can generate complementary transla-
the autoregressive property in global but performs NAT in lo- tions [57, 79, 161, 172]. For example, in Chinese-to-English
cal. Just as shown in Fig. 4, SAT generates K successive tar- translation, experiments show that L2R can generate better
get words at each time step in parallel. If K = 1, SAT will be prefix while R2L is good at producing suffix. Intuitively, it is
exactly AT. It will become NAT if K = I. By choosing an ap- a promising direction to combine the merits of bidirectional
propriate K, dependency relation between fragments are well inferences and fully exploit both history and future contexts
modeled and the translation quality can be much improved on the target side.
with some loss of efficiency. To this end, many researchers resort to exploring bidirec-
To mimic the decoder input in the AT model, [54] intro- tional decoding to take advantages of both L2R and R2L
duced a simple but effective method that employs a phrase inferences. These methods are mainly fall into four cat-
table which is the core component in SMT to convert source egories: 1, enhancing agreement between L2R and R2L
words into target words. Specifically, they first greedily seg- predictions [79, 164]; 2, reranking with bidirectional de-
ment the source sentence into phrases with maximum match coding [79, 106, 107]; 3, asynchronous bidirectional decod-
8 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
𝑙2𝑟
𝑦!%&' 𝑦"%&' 𝑦&%&' ⋯ %&'
𝑦()"
𝑥! 𝑥" ⋯ 𝑥$ ⋯ 𝑥#
𝑦('&%
! )" 𝑦('&%
! )& 𝑦('&%
! )* ⋯ 𝑦!'&%
𝑟2𝑙
…
<l2r> L2R
…
<l2r> SBAtt SBAtt
<r2l>
SBAtt SBAtt
<r2l> …
R2L
…
<pad>
Figure 5 illustration of the synchronous bidirectional inference model. The top demonstrates how the bidirectional contexts can be leveraged during inference.
The bottom compares the beam search algorithm between the conventional NMT and the synchronous bidirectional NMT.
ing [115, 161] and 4, synchronous bidirectional decoding [172] proposed a synchronous bidirectional decoding
[154, 172, 173]. model (SBD) that produces outputs using both L2R and R2L
Ideally, L2R decoding should generate the same transla- decoding simultaneously and interactively. Specifically, a
tion as R2L decoding. Under this reasonable assumption, new synchronous attention model is proposed to conduct in-
[79, 164] introduced agreement constraint or regularization teraction between L2R and R2L inferences. The top part in
between L2R and R2L predictions during training. Then, Fig. 5 gives a simple illustration of the proposed synchronous
L2R inference can be improved. bidirectional inference model. The dotted arrows in color on
The reranking algorithm is widely used in machine trans- the target side is the core of the SBD model. L2R and R2L in-
lation, and the R2L model can provide an estimation score for ferences interact with each other in an implicit way illustrated
the quality of L2R translation from another parameter space by the dotted arrows. All the arrows indicate the informa-
[79, 106, 107]. Specifically, L2R first generates a n-best list tion passing flow. Solid arrows show the conventional history
of translations. The R2L model is then leveraged to force de- context dependence while dotted arrows introduce the future
code each translation leading to a new score. Finally, the best context dependence on the other inference direction. For ex-
translation is selected according to the new scores. ample, besides the past predictions (yl2r l2r
0 , y1 ), L2R inference
r2l r2l
[115, 161] proposed an asynchronous bidirectional decod- can also utilize the future contexts (y0 , y1 ) generated by the
ing model (ASBD) which first obtains the R2L outputs and R2L inference when predicting yl2r 2 . The conditional proba-
optimizes the L2R inference model based on both of the bility of the translation can be written as follows:
source input and the R2L outputs. Specifically, [161] first
trained a R2L model with the bilingual training data. Then, Q
I →
−→ − →
− ←
− ←−
the optimized R2L decoder translates the source input of each i=0 P( yi | y 0 · · · y i−1 , x, y 0 · · · y i−1 )
if L2R
P(y|x) =
sentence pair and produces the outputs (hidden states) which I −1 P(←
Q 0 −
y i |←−
y0 ···← −
y i−1 , x,→
−y · · ·→
0
−y )
i−1 if R2L
i=0
serve as the additional context for L2R prediction when op- (10)
timizing the L2R inference model. Due to explicit use of
right-side future contexts, the ASBD model significantly im- To accommodate L2R and R2L inferences at the same
proves the translation quality. But these approaches still suf- time, they introduced a novel beam search algorithm. As
fer from two issues. On one hand, they have to train two sep- shown in the bottom right of Fig. 5, at each timestep during
arate NMT models for L2R and R2L inferences respectively. decoding, each half beam maintains the hypotheses from L2R
And the two-pass decoding strategy makes the latency much and R2L decoding respectively and each hypothesis is gener-
increased. On the other hand, the two models cannot interact ated by leveraging already predicted outputs from both direc-
with each other during inference, which limits the potential tions. At last, the final translation is chosen from L2R and
of performance improvement. R2L results according to their translation probability. Thanks
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 9
to appropriate rich interaction, the SBD model substantially (one source language to multiple target languages). [176]
boosts the translation quality while the decoding speed is only proposed to share the decoder for many-to-one translation
10% slowed down. (many source languages to one target language). [43,44] pro-
[173] further noticed that L2R and R2L are not necessary posed to share attention mechanism for many-to-many trans-
to produce the entire translation sentence. They let L2R gen- lation (many source languages to many target languages).
erate the left half translation and make R2L produce the right Despite performance improved for low-resource languages,
half, and then two halves are concatenated to form the final these methods are required to design a specific encoder or
translation. Using proper training algorithms, they demon- decoder for each language, hinders their scalability in deal-
strated through extensive experiments that both translation ing with many languages.
quality and decoding efficiency can be significantly improved [64] goes a step further and let all source languages share
compared to the baseline Transformer model. the same encoder and all the target languages share the same
decoder. They have successfully trained a single encoder-
6 Low-resource Translation decoder NMT model for multilingual translation. The biggest
issue is that the decoder is unaware of which target language
Most NMT models assume that enough bilingual training should be translated to at the test phase. To this end, [64]
data is available, which is the rare case in real life. For a introduced a simple strategy and added a special token indi-
low-resource language pair, a natural question may arise that cating target language (e.g 2en and 2zh) at the beginning of
what kind of knowledge can be transferred to build a rel- the source sentence. By doing this, low-resource languages
atively good NMT system. This section will discuss three have the biggest chance to share translation knowledge from
kinds of methods. One attempts to share translation knowl- other resource-rich languages. It also enables zero-shot trans-
edge from other resource-rich language pairs, in which pivot lation as long as the two languages are employed as source
translation and multilingual translation are the two key tech- and target in the multilingual NMT model. In addition, this
niques. Pivot translation assumes that for the low-resource unified multilingual NMT is very scalable and could trans-
pair A and B, there is a language C that has rich bitexts with late all the languages in one model ideally. However, ex-
A and B respectively [26, 142]. This section mainly discusses periments find that the output is sometimes mixed of mul-
the technique of multilingual translation in the first category. tiple languages even using a translation direction indicator.
The second kind of methods resort to semi-supervised ap- Furthermore, this paradigm enforces different source/target
proach which takes full advantages of limited bilingual train- languages to share the same semantic space, without con-
ing data and abundant monolingual data. The last one lever- sidering the structural divergency among different languages.
ages unsupervised algorithm that requires monolingual data The consequence is that the single model based multilingual
only. NMT yields inferior translation performance compared to in-
dividually trained bilingual counterparts. Most of recent re-
6.1 Multilingual neural machine translation search work mainly focus on designing better models to well
balance the language-independent parameter sharing and the
Let us first have a quick recap about the NMT model based
language-sensitive module design.
on encoder-decoder framework. The encoder is responsible
for mapping the source language sentence into distributed se- [13] augmented the attention mechanism in decoder with
mantic representations. The decoder is to convert the source- language-specific signals. [134] proposed to use language-
side distributed semantic representations into target language sensitive positions and language-dependent hidden presenta-
sentence. Apparently, the encoder and the decoder (exclud- tions for one-to-many translation. [98] designed an algorithm
ing the cross-language attention component) are just single- to generate language-specific parameters. [118] designed a
language dependent. Intuitively, the same source language language clustering method and forced languages in the same
in different translation systems (e.g. Chinese-to-English, cluster to share the parameters in the same semantic space.
Chinese-to-Hindi) can share the same encoder and the same [135] attempted to generate two languages simultaneously
target language can share the same decoder (e.g. Chinese- and interactively by sharing encoder parameters. [136] pro-
to-English and Hindi-to-English). Multilingual neural ma- posed a compact and language-sensitive multilingual transla-
chine translation is a framework that aims at building a uni- tion model which attempts to share most of the parameters
fied NMT model capable of translating multiple languages while maintaining the language discrimination.
through parameter sharing and knowledge transferring. As shown in Fig. 6, [136] designed four novel modules in
[36] is the first to design a multi-task learning method the Transformer framework compared to single-model based
which shares the same encoder for one-to-many translation multilingual NMT. First, they introduced a representor to re-
10 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
place both encoder and decoder by sharing weight parameters attention parameters according to specific translation tasks
of the self-attention block, feed-forward and normalization dynamically.
blocks (middle part in Fig. 6). It makes the multilingual NMT 3. Language-sensitive discriminator (top part in Fig. 6):
model as compact as possible and maximizes the knowledge for this module, they employed a neural model fdis on the top
sharing among different languages. The objective function layer of reprensentor hrep
top , and this model outputs a language
over L language pairs becomes: judgment score Plang .
Ml
L X
Plang = softmax(Wdis × fdis (hrep
top ) + bdis ) (12)
logP(yl(m) |x(m)
X
L(θ) = l ; θrep , θatt ) (11)
l=1 m=1 Combining the above four ideas together, they showed
where θrep and θatt denote parameters of representor and at- through extensive experiments that the new method signifi-
tention mechanism respectively. cantly improves multilingual NMT on one-to-many, many-
to-many and zero-shot scenarios, outperforming bilingual
counterparts in most cases. It indicates that low-resource lan-
guage translation can greatly benefit from this kind of multi-
lingual NMT, and so do zero-resource language translations.
source-to-target
𝑎
!
𝑃 𝑥 |𝑥; 𝜃"#$ , 𝜃"#$ = ( 𝑃 𝑦|𝑥; 𝜃"#$ 𝑃 𝑥′|𝑦; 𝜃$#"
𝑥 %
𝑦 = 𝑠2𝑡 𝑥
𝑥 ! = 𝑡2𝑠 𝑦
𝑅 𝑥 ! , 𝑥, 𝜃"#$ , 𝜃$#" = 𝐵𝐿𝐸𝑈 𝑥, 𝑥′ + 𝐿𝑀 𝑦 + 𝐿𝑀 𝑥′
𝑏
target-to-source
Figure 7 illustration of two methods exploring monolingual data. If the parameters are trained to maximize the objective function of (a), it is the auto-encoder
based method. If using reward as (b) shows, it is the dual learning method. Note that this figure only demonstrates the usage of source-side monolingual data
for simplicity. The use of target-side monolingual data is symmetric.
task and source sentence reordering task by sharing the same data D x = {x(lx ) }lLxx=1 and target-side monolingual data Dy =
encoder. L
{y (ly ) }lyy=1 .
Many researchers resort to use both side monolingual Unsupervised machine translation can date back to the era
data in NMT at the same time [25, 56, 163, 170]. We of SMT, in which decipherment approach is employed to
summarize two methods in Fig. 7: the auto-encoder based learn word translations from monolingual data [37, 96, 102]
semi-supervised learning method [25] and the dual learning or bilingual phrase pairs can be extracted and their probabili-
method [56]. For a source-side monolingual sentence x, [25] ties can be estimated from monolingual data [67, 155].
employed source-to-target translation as encoder to generate Since [91] found that word embeddings from two lan-
latent variable y and leverage target-to-source translation as guages can be mapped using some seed translation pairs,
decoder to reconstruct the input leading to x0 . They optimize bilingual word embedding learning or bilingual lexicon in-
the parameters by maximizing the reconstruction probabil- duction has attracted more and more attention [4, 21, 31, 41,
ity as shown in Fig. 7(a). The target-side monolingual data 158, 159]. [4] and [31] applied linear embedding mapping
is used in a symmetric way. Fig. 7(b) shows the objective and adversarial training to learn word pair matching in the
function for the dual learning method. [56] treated source-to- distribution level and achieve promising accuracy for similar
target translation as the primal task and target-to-source trans- languages.
lation as the dual task. Agent A sends through the primal task Bilingual lexicon induction greatly motivates the study
a translation of the source monolingual sentence to the agent of unsupervised NMT on sentence level. And two tech-
B. B is responsible to estimate the quality of the translation niques of denoising auto-encoder and back-translation make
with a language model and the dual task. The rewards includ- it possible for unsupervised NMT. The key idea is to find
ing the similarity between the input x and reconstructed one a common latent space between the two languages. [5]
x0 , and two language model scores LM(y), LM(x0 ), are em- and [72] both optimized dual tasks of source-to-target and
ployed to optimize the network parameters of both source- target-to-source translation. [5] employed shared encoder to
to-target and target-to-source NMT models. Similarly, the force two languages into a same semantic space, and two
target-side monolingual data is used in a symmetric way in language-dependent decoders. In contrast, [72] ensured the
dual learning. two languages share the same encoder and decoder, relying
[163] introduced an iterative back-translation algorithm to on an identifier to indicate specific language similar to single-
exploit both source and target monolingual data with an EM model based multilingual NMT [64]. The architecture and
optimization method. [170] proposed a mirror-generative training objective functions are illustrated in Fig. 8.
NMT model, that explores the monolingual data by unifying
The top in Fig. 8 shows the use of denoising auto-encoder.
the source-to-target NMT model, the target-to-source NMT
The encoder encodes a noisy version of the input x into hid-
model, and two language models. They showed better per-
den representation z src which is used to reconstruct the input
formance can be achieved compared to back-translation, iter-
with the decoder. The distance (auto-encoder loss Lauto ) be-
ative back-translation and dual learning.
tween the reconstruction x0 and the input x should be as small
as possible. To guarantee source and target languages share
6.3 Unsupervised neural machine translation
the same semantic space, an adversarial loss Ladv is intro-
Unsupervised neural machine translation addresses a very duced to fool the language identifier.
challenging scenario in which we are required to build a The bottom in Fig. 8 illustrates the use of back-translation.
NMT model using only massive source-side monolingual A target sentence y is first back-translated into x∗ using an old
12 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
𝑧!"# 𝑥′
𝑥~𝐷!"#
add
Enc(src) Dec(src) ℒ&)$*
𝑥
noise
ℒ&'(
add 𝑦′ 𝑦
𝑦~𝐷$%$ Enc(tgt) Dec(tgt) ℒ&)$*
noise 𝑧$%$
𝑧!"#
𝑦~𝐷$%$ $%&
𝑁𝑀𝑇!"#
𝑥∗ add
Enc(src) Dec(src)
𝑥′ ℒ!"#$% 𝑥
noise
ℒ&'(
𝑦∗ add 𝑦′ 𝑦
𝑥~𝐷!"# $%&
𝑁𝑀𝑇#"!
noise
Enc(tgt) Dec(tgt) ℒ!"#$%
𝑧$%$
Figure 8 architecture of the unsupervised NMT model. The top shows denosing auto-encoder that aims at reconstructing the same language input. The bot-
tom demonstrates back-translation which attempt to reconstruct the input in the other language using back-translation (target-to-source) and forward translation
(source-to-target) The auto-encoder loss Lauto , the translation loss Ltrans and the language adversarial loss Ladv are used together to optimize the dual NMT
models.
target-to-source NMT model (the one optimized in previous for machine translation are not publicly available currently.
iteration, and the initial model is the word-by-word transla- Recently, IWSLT-20204) organized the first evaluation on ve-
tion model based on bilingual word induction). Then, the dio translation in which annotated video data is only available
noisy version of the translation x∗ is encoded into z src which for validation and test sets.
is then translated back into target sentence y0 . The new NMT Translation for paired image-text, offline speech-to-text
model (encoder and decoder) is optimized to minimize the translation and simultaneous translation have become in-
translation loss Ltrans which is the distance between the trans- creasingly popular in recent years.
lation y0 and the original target input y. Similarly, an adver-
sarial loss is employed in the encoder module. This process 7.1 Image-Text Translation
iterates until convergence of the algorithm. Finally, the en-
coder and decoder can be applied to perform dual translation Given an image and its text description as source language,
tasks. the task of image-text translation aims at translating the de-
scription in source language into the target language, where
[147] argued that sharing some layers of encoder and de-
the translation process can be supported by information from
coder while making others language-specific could improve
the paired image. It is a task requiring the integration of nat-
the performance of unsupervised NMT. [6] further combined
ural language processing and computer vision. WMT5) orga-
the NMT and SMT to improve the unsupervised translation
nized the first evaluation task on image-text translation (they
quality. Most recently, [30, 103, 113] resorted to pre-training
call it multimodal translation) in 2016 and also released the
techniques to enhance the unsupervised NMT model. For
widely-used dataset Multi30K consisting of about 30K im-
example, [30] proposed a cross-lingual language model pre-
ages each of which has an English description and transla-
training method under BERT framework [33]. Then, two pre-
tions in German, French and Czech6) . Several effective mod-
trained cross-lingual language models are employed as the
els have been proposed since then. These methods mainly
encoder and decoder respectively to perform translation.
differ in the usage of the image information and we mainly
discuss four of them in this section.
7 Multimodal neural machine translation [59] proposed to encode the image into one distributed
vector representation or a sequence of vector representations
We know that humans communicate with each other in a using convolutional neural networks. Then, they padded the
multimodal environment in which we see, hear, smell and vector representations together with the sentence as the final
so on. Naturally, it is ideal to perform machine translation input for the NMT model which does not need to be modified
with the help of texts, speeches and images. Unfortunately, for adaptation. The core idea is that they did not distinguish
video corpora containing parallel texts, speech and images images from texts in the model design.
4) http://iwslt.org/doku.php?id=evaluation
5) https://www.statmt.org/wmt16/multimodal-task.html
6) https://github.com/multi30k/dataset
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 13
Image Decoder
Text Decoder
Text Decoder
double attention translation imagination
Text Encoder
Image Encoder
Encoder
source sentence image source sentence
Figure 9 comparison between the doubly-attentive model and the image imagination model for image-text translation. In doubly attentive model, the image
is encoded as an additional input feature. While in image imagination model, the image is decoded output from the source sentence.
[19] presented a doubly-attentive decoder for the image- translation and the other for image imagination.
text translation task. The major difference from [59] is that
they design textual encoder and visual encoder respectively, M
X
and employ two separate attention models to balance the con- L(θ) = logP(y (m) |x(m) ; θ) + logP(I M (m) |x(m) ; θ) (14)
tribution of text and image during prediction at each time- m=1
ASR and NMT are not coupled and can be optimized inde- tion layer for ASR and the word embeddings. The text sen-
pendently. tence in MT is converted into the same length as the CTC
Nevertheless, the pipeline method has two disadvantages. output sequence of the ASR model. By doing this, the ASR
On one hand, the errors propagate through the pipeline and encoder and the MT encoder will be consistent in length and
the ASR errors are difficult to make up during translation. semantic representations. Therefore, the pre-trained encoder
On the other hand, the efficiency is limited due to the two- and attention in the MT model can be used in ST in addition
phase process. [175] believed in early years that end-to-end to the ASR encoder and the MT decoder.
speech translation is possible with the development of mem-
ory, computational capacity and representation models. Deep
first showed that end-to-end model can outperform a cascade Distillation Loss
learning based on distributed
of independently representations
trained pipeline systemfacilitates the
on Fisher Callhome
end-to-end modeling for speech
Spanish-English speech translation.
translation task.[12] presented
Bansal et al. [12] Student Teacher
found that pretraining encoder on higher-resource ASR train- softmax softmax
an end-to-ending
model without
data can achieveusing any improvements
significant source language tran-
on low-resource
scriptions under
speech antranslation,
encoder-decoder
even when the framework. Different
audios in two tasks do not be- ST decoder MT decoder
long to the same language. However, these work mainly resort Feed Forward Feed Forward
from the pipeline paradigm, the end-to-end model should be Networks Networks
to pretraining acoustic encoder and do not take full advantage
optimized onofthe training
text data. data consisting of instances (source N
MultiHead MultiHead
N
Knowledge
speech, target text). We listdistillation
some of theis first adoptedused
recently to apply for model
datasets Attention Attention
compression, whose 7) main idea is to train a student 8)model
in Table. 1, including
to mimic the IWSLT
behaviors, Augmented Librispeech
of a teacher model. It has ,soon MultiHead MultiHead
Attention Attention
9) to a variety10) of tasks, like image11)
Fisher and Callhome , MuST-C and TED-Corpus .
been applied classification
[13, 20, 21, 22], speech recognition [13] and natural language
It is easy to find from
processing the23].
[14, 17, table
Thethat theand
teacher training
student data
modelfor
in con- ST encoder MT encoder
of ST. Multi-task learning [2, 11, 140], knowledge distillation Positional Positional
[63, 81] and pre-training [9, 126]3.areModels three main research di- Encoding Encoding
sentence.
X
LS T (θ) = − logP(y (m) |s(m) ; θ) (15)
(s,y)∈D
I X
|V|
图 1 语音识别和语音翻译交互示例
X
logP(y|s, θ) = I(yi = v)logP(yi |s, y<i , θ) (16) Figure 11 the demonstration of the interactive model for both ASR and
i=0 v=1
ST. Taking T = 2 as an example, the transcription ”everything” of the ASR
model can be helpful to predict the Chinese translation at T = 2. Likewise,
the translation at time step T = 1 is also beneficial to generate the transcrip-
where |V| denotes the vocabulary size of the target language tions of the ASR model in the future time steps.
and I(yi = v) is the indication function which indicates
whether the i-th output token yi happens to be the ground 7.3 Simultaneous Machine Translation
truth.
Simultaneous machine translation (SimMT) aims at translat-
Given the MT teacher model pre-trained on large-scale
ing concurrently with the source-language speaker speaking.
data, it can be used to force decode the pair (x, y) from the
It addresses the problem where we need to incrementally pro-
triple (s, x, y) and will obtain a probability distribution for
l duce
论文方法 the translation while the source-language speech is be-
each target word yi : Q(yi |x, y<i ; θ MT ). Then, the knowledge
ing received. This technology is very helpful for live events
distillation loss can be written as follows: 针对上述问题,中科院自动化所自然语言处理组博士生刘宇宸、张家俊研究员、宗成庆
and real-time video-call translation. Recently, Baidu and
Facebook organized the first evaluation task on SimMT in
ACL-202012) and IWSLT-202013) respectively.
I |V|
X XX Obviously, the methods of offline speech translation intro-
LKD (θ) = − Q(yi |x, y<i ; θ MT )logP(yi |x, y<i , θ)
(x,y)∈D i=0 v=1
duced in the previous section cannot be applicable in these
(17) scenarios, since the latency must be intolerable if translation
begins after speakers complete their utterance. Thus, balanc-
ing between latency and quality becomes the key challenge
The final ST model can be trained by optimizing both of for the SimMT system. If it translates before the necessary
the log-likelihood loss LS T (θ) and the knowledge distillation information arrives, the quality will decrease. However, the
loss LKD (θ). delay will be unnecessary if it waits for too much source-
In order to fully explore the integration of ASR and ST, language contents.
[82] further proposed an interactive model in which ASR and [94,95] proposed to directly perform simultaneous speech-
MT perform synchronous decoding. As shown in Fig. 11, the to-text translation, in which the model is required to generate
dynamic outputs of each model can be used as the context to the target-language translation from the incrementally incom-
improve the prediction of the other model. Through interac- ing foreign speech. In contrast, more research work focus on
tion, the quality of both models can be significantly improved the simultaneous text-to-text translation where they assume
while keeping the efficiency as much as possible. that the transcriptions are correct [1, 3, 7, 32, 48, 51, 86, 105,
12) https://autosimtrans.github.io/shared
13) http://iwslt.org/doku.php?id=simultaneous translation
16 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
𝑠𝑜𝑢𝑟𝑐𝑒: 𝑥! 𝑥" 𝑥$ ⋯ 𝑥#
sequence-to-sequence
𝑠𝑜𝑢𝑟𝑐𝑒: 𝑥! 𝑥" 𝑥$ ⋯ 𝑥#
𝑠𝑜𝑢𝑟𝑐𝑒: 𝑥! 𝑥" 𝑥$ ⋯ 𝑥#
𝑡𝑎𝑟𝑔𝑒𝑡: 𝜀 𝑦! 𝑦" 𝜀 𝑦$ ⋯
Figure 12 the illustration of three policies for simultaneous machine translation. The top part is the conventional sequence-to-sequence MT model which
begins translation after seeing the whole source sentence. The middle one demonstrates the wait-k policy which waits for k words before translation. The
bottom part shows an example of the adaptive policy that predicts an output token at each time step. If the output is a special token hεi, it indicates reading one
more source word.
167, 168]. This article mainly introduces the latter methods. each source word is constrained to only attend to its prede-
All of the methods address the same strategy (also known as cessors and the hidden semantic representation of the i + k-th
policy) that when to read an input word from the source lan- position will summarize the semantics of the prefix x<i+k .
guage and when to write an output word in target language, However, the wait-k policy is a fixed-latency model and
namely when to wait and when to translate. it is difficult to decide k for different sentences, domains and
In general, the policies can be categorized into two bins. languages. Thus, adaptive policy is more appealing. Early at-
One is fixed-latency policies [32, 86], such as wait-k policy. tempts for adaptive policy are based on reinforcement learn-
The other is adaptive policies [3, 7, 51, 105, 167, 168]. ing (RL) method. For example, [51] presented a two-stage
The wait-k policy proposed by [86] is proved simple but ef- model that employs the pre-trained sentence-based NMT as
fective. Just as shown in the middle part of Fig. 12, the wait-k the base model. On top of the base model, read or translate
policy starts to translate after reading the first k source words. actions determine whether to receive a new source word or
Then, it alternates between generating a target-language word output a target word. These actions are trained using the RL
and reading a new source-language word, until it meets the method by fixing the base NMT model.
end of the source sentence. Accordingly, the probability of Differently, [167] proposed an end-to-end simMT model
a target word yi is conditioned on the history predictions y<i for adaptive policies. They first add a special delay token hεi
and the prefix of the source sentence x<i+k : P(yi |y<i , x<i+k ; θ). into the target-language vocabulary. As shown in the bottom
The probability of the whole target sentence becomes: part of Fig. 12, if the model predicts hεi, it needs to receive a
new source word. To train the adaptive policy model, they de-
I
Y sign dynamic action oracles with aggressive and conservative
P(y|x; θ) = P(yi |y<i , x<i+k ; θ) (18) bounds as the expert policy for imitation learning. Suppose
i=0 the prefix pair is (s, t). Then, the dynamic action oracle can
In contrast to previous sequence-to-sequence NMT train- be defined as follows:
ing paradigm, [86] designed a prefix-to-prefix training style
to best explore the wait-k policy. If Transformer is employed
{hεi} if s , x and |s| − |t| 6 α
as the basic architecture, prefix-to-prefix training algorithm
?
Πx,y,α,β (s, t) = if t , y and |s| − |t| > β
{y|t|+1 }
only needs a slight modification. The key difference from
{hεi, y|t|+1 } otherwise
Transformer is that prefix-to-prefix model conditions on the
first i + k rather than all source words at each time-step i. where α and β are hyper-parameters, denoting aggressive and
It can be easily accomplished by applying the masked self- conservative bounds respectively. |s| − |t| calculates the dis-
attention during encoding the source sentence. In that case, tance between two prefixes. That is to say if the current target
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 17
prefix t is no more than α words behind the source prefix s, to translate a sentence. It is still a question whether it is
we can read a new source word. If t is shorter than s with reasonable to accomplish document translation by translat-
more than β words, we generate the next target prediction. ing the sentences from first to last. Maybe translation based
on sentence group is a worthy research topic which models
many-to-many translation. In addition, document-level eval-
8 Discussion and Future Research Tasks
uation is as important as the document-level MT methods,
8.1 NMT vs. Human and it serves as a booster of MT technology. [55] argued that
MT can achieve human parity in Chinese-to-English trans-
We can see from Sec.4-7 that great progresses have been lation on specific news tests if evaluating sentence by sen-
achieved in neural machine translation. Naturally, we may tence. However, as we discussed in the previous section that
wonder whether current strong NMT systems could perform [73, 74] demonstrated a stronger preference for human over
on par with or better than human translators. Exciting news MT when evaluating on document-level rather than sentence-
were reported in 2018 by [55] that they achieved human- level. Therefore, how to automatically evaluate the quality of
machine parity on Chinese-to-English news translation and document translation besides BLEU [97] is an open question
they found no significant difference of human ratings between although some researchers introduce several test sets to in-
their MT outputs and professional human translations. More- vestigate some specific discourse phenomena [92].
over, the best English-to-Czech system submitted to WMT 2. Efficient NMT inference
2018 by [99] was also found to perform significantly better
People prefer the model with both high accuracy and ef-
than the human-generated reference translations [14]. It is
ficiency. Despite of remarkable speedup, the quality degra-
encouraging that NMT can achieve very good translations in
dation caused by non-atuoregressive NMT is intolerable in
some specific scenarios and it seems that NMT has achieved
most cases. Improving the fertility model, word ordering of
the human-level translation quality.
decoder input and dependency of the output will be worthy of
However, we cannot be too optimistic since the MT tech-
a good study to make NAT close to AT in translation quality.
nology is far from satisfactory. On one hand, the comparisons
Synchronous bidirectional decoding deserves deeper investi-
were conducted only on news domain in specific language
gation due to good modeling of history and future contexts.
pairs where massive parallel corpora are available. In prac-
Moreover, several researchers start to design decoding algo-
tice, NMT performs quite poorly in many domains and lan-
rithm with free order [40, 50, 114] and it may be a good way
guage pairs, especially for the low-resource scenarios such
to study the nature of human language generation.
as Chinese-Hindi translation. On the other hand, the eval-
uation methods on the assessment of human-machine parity 3. Making full use of multilinguality and monolingual
conducted by [55] should be much improved as pointed out data
by [73]. According to the comprehensive investigations con- Low-resource translation is always a hot research topic
ducted by [73], human translations are much preferred over since most of natural languages are lack of abundant anno-
MT outputs if using better rating techniques, such as choos- tated bilingual data. The potential of multilingual NMT is
ing professional translators as raters, evaluating documents not fully explored and some questions remain open. For ex-
rather than individual sentences and utilizing original source ample, how to deal with data unbalance problem which is
texts instead of source texts translated from target language. very common in multilingual NMT? How to build a good
Current NMT systems still suffer from serious translation er- incremental multilingual NMT model for incoming new lan-
rors of mistranslated words or named entities, omissions and guages? Semi-supervised NMT is more practical in real ap-
wrong word order. Obviously, there is much room for NMT plications but the effective back-translation algorithm is very
to improve and we suggest some potential research directions time consuming. It deserves to design a much efficient semi-
in the next section. supervised NMT model for easy deployment. Deeply inte-
grating pre-trained method with NMT may lead to promising
improvement in the semi-supervised or unsupervised learn-
8.2 Future Research Tasks
ing framework and [62, 174] have already shown some good
In this section, we discuss some potential research directions improvements. The achievements of unsupervised MT in
for neural machine translation. similar language pairs (e.g. English-German and English-
1. Effective document-level translation and evaluation French) make us very optimistic. However, [76] showed that
It is well known that document translation is important and unsupervised MT performs poorly on distant language pairs,
the current research results are not so good. It remains un- obtaining no more than 10 BLEU scores in most cases. Ob-
clear what is the best scope of the document context needed viously, it is challenging to design better unsupervised MT
18 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
Proceedings of COLING, 2018. and Marco Turchi. Must-c: a multilingual speech translation corpus.
14 Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, In Procedings of NAACL, 2019.
Barry Haddow, Philipp Koehn, and Christof Monz. Findings of the 35 Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun. Visual-
2018 conference on machine translation (WMT18). In Proceedings izing and understanding neural machine translation. In Proceedings
of WMT, 2018. of ACL, pages 1150–1159, 2017.
15 Leo Born, Mohsen Mesgar, and Michael Strube. Using a graph- 36 Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang.
based coherence model in document-level machine translation. In Multi-task learning for multiple language translation. In Proceedings
Proceedings of the Third Workshop on Discourse in Machine Trans- of ACL-IJCNLP, pages 1723–1732, 2015.
lation, pages 26–35, 2017. 37 Qing Dou and Kevin Knight. Large scale decipherment for out-
16 Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and of-domain machine translation. In Proceedings of EMNLP-CoNLL,
Robert L Mercer. The mathematics of statistical machine transla- pages 266–275. Association for Computational Linguistics, 2012.
tion: Parameter estimation. Computational linguistics, 19(2):263– 38 Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Under-
311, 1993. standing back-translation at scale. In Proceedings of EMNLP, 2018.
17 Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and Loı̈c Bar- 39 Desmond Elliott and Akos Kádár. Imagination improves multimodal
rault. Probing the need for visual context in multimodal machine translation. In Proceedings of IJCNLP, 2017.
translation. In Proceedings of NAACL, 2019. 40 Dmitrii Emelianenko, Elena Voita, and Pavel Serdyukov. Sequence
18 Iacer Calixto and Qun Liu. An error analysis for image-based modeling with unconstrained generation order. In Advances in Neu-
multi-modal neural machine translation. Machine Translation, 33(1- ral Information Processing Systems, pages 7698–7709, 2019.
2):155–177, 2019. 41 Manaal Faruqui and Chris Dyer. Improving vector space word rep-
19 Iacer Calixto, Qun Liu, and Nick Campbell. Doubly-attentive de- resentations using multilingual correlation. In Proceedings of EACL,
coder for multi-modal neural machine translation. In Proceedings of pages 462–471, 2014.
ACL, 2017. 42 Yang Feng, Shiyue Zhang, Andi Zhang, Dong Wang, and Andrew
20 Iacer Calixto, Miguel Rios, and Wilker Aziz. Latent variable model Abel. Memory-augmented neural machine translation. In Proceed-
for multi-modal translation. In Proceedings of ACL, 2019. ings of ACL, 2019.
21 Hailong Cao and Tiejun Zhao. Point set registration for unsupervised 43 Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multi-
bilingual lexicon induction. In IJCAI, pages 3991–3997, 2018. lingual neural machine translation with a shared attention mechanism.
22 Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang In Proceedings of NAACL, 2016.
Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, 44 Orhan Firat, Kyunghyun Cho, Baskaran Sankaran, Fatos T Yarman
Zhifeng Chen, et al. The best of both worlds: Combining recent ad- Vural, and Yoshua Bengio. Multi-way, multilingual neural machine
vances in neural machine translation. In Proceedings of ACL, pages translation. Computer Speech & Language, 45:236–252, 2017.
76–86, 2018. 45 Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and
23 Yong Cheng, Lu Jiang, and Wolfgang Macherey. Robust neural ma- Yann N Dauphin. Convolutional sequence to sequence learning. In
chine translation with doubly adversarial inputs. In Proceedings of Proceedings of ICML, 2017.
ACL, 2019. 46 Zhengxian Gong, Min Zhang, and Guodong Zhou. Cache-based
24 Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang document-level statistical machine translation. In Proceedings of
Liu. Towards robust neural machine translation. In Proceedings of EMNLP, pages 909–919. Association for Computational Linguistics,
ACL, 2018. 2011.
25 Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, 47 Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
and Yang Liu. Semi-supervised learning for neural machine transla- Schmidhuber. Connectionist temporal classification: labelling unseg-
tion. In Proceedings of ACL, 2016. mented sequence data with recurrent neural networks. In Proceedings
26 Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Wei Xu. Joint of ICML, pages 369–376, 2006.
training for pivot-based neural machine translation. In Proceedings 48 Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal
of IJCAI, 2017. Daumé III. Dont until the final verb wait: Reinforcement learning for
27 David Chiang. A hierarchical phrase-based model for statistical ma- simultaneous machine translation. In Proceedings of EMNLP, pages
chine translation. In Proceedings of ACL, pages 263–270. Associa- 1342–1352, 2014.
tion for Computational Linguistics, 2005. 49 Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and
28 Chenhui Chu, Raj Dabre, and Sadao Kurohashi. An empirical com- Richard Socher. Non-autoregressive neural machine translation. In
parison of simple domain adaptation methods for neural machine Proceedings of ICLR, 2018.
translation. In Proceedings of ACL, 2017. 50 Jiatao Gu, Qi Liu, and Kyunghyun Cho. Insertion-based decoding
29 Chenhui Chu and Rui Wang. A survey of domain adaptation for with automatically inferred generation order. Transactions of the As-
neural machine translation. In Proceedings of COLING, 2018. sociation for Computational Linguistics, 7:661–676, 2019.
30 Alexis Conneau and Guillaume Lample. Cross-lingual language 51 Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li.
model pretraining. In Advances in Neural Information Processing Learning to translate in real-time with neural machine translation. In
Systems, pages 7057–7067, 2019. Proceedings of EACL, 2017.
31 Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic 52 Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Bar-
Denoyer, and Hervé Jégou. Word translation without parallel data. rault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua
In Proceedings of ICLR, 2018. Bengio. On using monolingual corpora in neural machine transla-
32 Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. In- tion. arXiv preprint arXiv:1503.03535, 2015.
cremental decoding and training methods for simultaneous translation 53 Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and
in neural machine translation. In Proceedings of NAACL, 2018. Yoshua Bengio. On integrating a language model into neural ma-
33 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. chine translation. Computer Speech & Language, 45:137–148, 2017.
Bert: Pre-training of deep bidirectional transformers for language un- 54 Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan
derstanding. In Proceedings of NAACL, 2019. Liu. Non-autoregressive neural machine translation with enhanced
34 Mattia A Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, decoder input. In Proceedings of the AAAI, volume 33, pages 3723–
20 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
lation in neural machine translation. In Proceedings of WMT, 2018. In Proceedings of ICML, 2019.
93 Makoto Nagao. A framework of a mechanical translation between 114 Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. In-
japanese and english by analogy principle. Artificial and human in- sertion transformer: Flexible sequence generation via insertion oper-
telligence, pages 351–354, 1984. ations. arXiv preprint arXiv:1902.03249, 2019.
94 Jan Niehues, Thai Son Nguyen, Eunah Cho, Thanh-Le Ha, Kevin Kil- 115 Jinsong Su, Xiangwen Zhang, Qian Lin, Yue Qin, Junfeng Yao, and
gour, Markus Müller, Matthias Sperber, Sebastian Stüker, and Alex Yang Liu. Exploiting reverse target-side contexts for neural machine
Waibel. Dynamic transcription for low-latency speech translation. In translation via asynchronous bidirectional decoding. Artificial Intel-
Interspeech, pages 2513–2517, 2016. ligence, 277:103168, 2019.
95 Jan Niehues, Ngoc-Quan Pham, Thanh-Le Ha, Matthias Sperber, and 116 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence
Alex Waibel. Low-latency neural speech translation. In Proceedings learning with neural networks. In Proceedings of NIPS, 2014.
of Interspeech, 2018. 117 Xin Tan, Longyin Zhang, Deyi Xiong, and Guodong Zhou. Hierar-
96 Malte Nuhn, Arne Mauser, and Hermann Ney. Deciphering foreign chical modeling of global context for document-level neural machine
language by combining language models and context vectors. In translation. In Proceedings of EMNLP-IJCNLP, pages 1576–1585,
Proceedings of ACL, pages 156–164. Association for Computational 2019.
Linguistics, 2012. 118 Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao Qin, and Tie-Yan Liu.
97 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Multilingual neural machine translation with language clustering. In
Bleu: a method for automatic evaluation of machine translation. In Proceedings of EMNLP-IJCNLP, 2019.
Proceedings of ACL, pages 311–318, 2002. 119 Jörg Tiedemann and Yves Scherrer. Neural machine translation with
98 Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, extended context. In Proceedings of the Third Workshop on Dis-
and Tom Mitchell. Contextual parameter generation for universal course in Machine Translation, 2017.
neural machine translation. In Proceedings of EMNLP, 2018. 120 Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. Learning to
99 Martin Popel. Cuni transformer neural mt system for wmt18. In remember translation history with a continuous cache. Transactions
Proceedings of WMT, pages 482–487, 2018. of the Association for Computational Linguistics, 6:407–420, 2018.
100 Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris 121 Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li.
Callison-Burch, and Sanjeev Khudanpur. Improved speech-to-text Modeling coverage for neural machine translation. In Proceedings of
translation with the Fisher and Callhome Spanish–English speech ACL, 2016.
translation corpus. In Proceedings of IWSLT, 2013. 122 Ashish Vawani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
101 Ofir Press and Lior Wolf. Using the output embedding to improve Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten-
language models. In Proceedings of EACL, 2017. tion is all you need. arXiv preprint arXiv:1706.03762, 2017.
102 Sujith Ravi and Kevin Knight. Deciphering foreign language. In 123 Elena Voita, Rico Sennrich, and Ivan Titov. Context-aware mono-
Proceedings of ACL, pages 12–21, 2011. lingual repair for neural machine translation. In Proceedings of
103 Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, and Shuai Ma. Explicit EMNLP-IJCNLP, 2019.
cross-lingual pre-training for unsupervised machine translation. In 124 Elena Voita, Rico Sennrich, and Ivan Titov. When a good translation
Proceedings of EMNLP-IJCNLP, 2019. is wrong in context: Context-aware machine translation improves on
104 Annette Rios and Don Tuggener. Co-reference resolution of elided deixis, ellipsis, and lexical cohesion. In Proceedings of ACL, 2019.
subjects and possessive pronouns in spanish-english statistical ma- 125 Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov.
chine translation. In Proceedings of EACL, 2017. Context-aware neural machine translation learns anaphora resolution.
105 Harsh Satija and Joelle Pineau. Simultaneous machine translation In Proceedings of ACL, 2018.
using deep reinforcement learning. In ICML 2016 Workshop on Ab- 126 Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, and Ming Zhou.
straction in Reinforcement Learning, 2016. Bridging the gap between pre-training and fine-tuning for end-to-end
106 Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, speech translation. In Proceedings of AAAI, 2020.
Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, 127 Chunqi Wang, Ji Zhang, and Haiqing Chen. Semi-autoregressive
and Philip Williams. The university of edinburgh’s neural mt sys- neural machine translation. In Proceedings of EMNLP, 2018.
tems for wmt17. In Proceedings of WMT, 2017. 128 Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. Exploiting
107 Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neu- cross-sentence context for neural machine translation. In Proceedings
ral machine translation systems for wmt 16. In Proceedings of WMT, of EMNLP, 2017.
pages 371–376, 2016. 129 Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F
108 Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neu- Wong, and Lidia S Chao. Learning deep transformer models for ma-
ral machine translation models with monolingual data. In Proceed- chine translation. In Proceedings of ACL, 2019.
ings of ACL, 2016. 130 Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, and Eiichiro
109 Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine Sumita. Instance weighting for neural machine translation domain
translation of rare words with subword units. In Proceedings of ACL, adaptation. In Proceedings of EMNLP, pages 1482–1488, 2017.
2016. 131 Shuo Wang, Yang Liu, Chao Wang, Huanbo Luan, and Maosong Sun.
110 Chenze Shao, Yang Feng, Jinchao Zhang, Fandong Meng, Xilin Improving back-translation with uncertainty-based confidence esti-
Chen, and Jie Zhou. Retrieving sequential information for non- mation. In Proceedings of EMNLP, 2019.
autoregressive neural machine translation. In Proceedings of ACL, 132 Xing Wang, Zhaopeng Tu, Deyi Xiong, and Min Zhang. Translating
2019. phrases in neural machine translation. In Proceedings of EMNLP,
111 Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong 2017.
Sun, and Yang Liu. Minimum risk training for neural machine trans- 133 Xing Wang, Zhaopeng Tu, and Min Zhang. Incorporating statistical
lation. In Proceedings of ACL, 2016. machine translation word knowledge into neural machine translation.
112 David R So, Chen Liang, and Quoc V Le. The evolved transformer. IEEE/ACM Transactions on Audio, Speech, and Language Process-
In Proceedings of ICML, 2019. ing, 26(12):2255–2266, 2018.
113 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: 134 Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing
Masked sequence to sequence pre-training for language generation. Zong. Three strategies to improve one-to-many multilingual transla-
22 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)
tion. In Proceedings of EMNLP, pages 2955–2960, 2018. lation model from monolingual data with application to domain adap-
135 Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu, and Chengqing tation. In Proceedings of ACL, 2013.
Zong. Synchronously generating two languages with interactive de- 156 Jiajun Zhang, Chengqing Zong, et al. Deep neural networks in ma-
coding. In Proceedings of the EMNLP-IJCNLP, 2019. chine translation: An overview. IEEE Intelligent Systems, 2015.
136 Yining Wang, Long Zhou, Jiajun Zhang, Feifei Zhai, Jingfang Xu, 157 Jiajun Zhang, Chengqing Zong, et al. Exploiting source-side mono-
Chengqing Zong, et al. A compact and language-sensitive multilin- lingual data in neural machine translation. In Proceedings of EMNLP,
gual translation method. In Proceedings of ACL, 2018. 2016.
137 Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie- 158 Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Adversar-
Yan Liu. Non-autoregressive machine translation with auxiliary reg- ial training for unsupervised bilingual lexicon induction. In Proceed-
ularization. In Proceedings of the AAAI, 2019. ings of ACL, pages 1959–1970, 2017.
138 Warren Weaver. Translation. Machine translation of languages, 159 Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Earth
14:15–23, 1955. movers distance minimization for unsupervised bilingual lexicon in-
139 Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and duction. In Proceedings of EMNLP, pages 1934–1945, 2017.
Xu Sun. Imitation learning for non-autoregressive neural machine 160 Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridg-
translation. In Proceedings of ACL, 2019. ing the gap between training and inference for neural machine trans-
140 Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and lation. In In Proceedings of ACL, pages 4334–4343, 2019.
Zhifeng Chen. Sequence-to-sequence models can directly translate 161 Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and
foreign speech. In Proceedings of Interspeech, 2017. Hongji Wang. Asynchronous bidirectional decoding for neural ma-
141 Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael chine translation. In Proceedings of AAAI 2018, 2018.
Auli. Pay less attention with lightweight and dynamic convolutions. 162 Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul McNamee, Ma-
In Proceedings of ICLR, 2019. rine Carpuat, and Kevin Duh. Curriculum learning for domain adap-
142 Hua Wu and Haifeng Wang. Pivot language approach for tation in neural machine translation. In Proceedings of NAACL, 2019.
phrase-based statistical machine translation. Machine Translation, 163 Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen.
21(3):165–181, 2007. Joint training for neural machine translation models with monolin-
143 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad gual data. In Thirty-Second AAAI, 2018.
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, 164 Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou,
Klaus Macherey, et al. Google’s neural machine translation system: and Tong Xu. Regularizing neural machine translation by target-
Bridging the gap between human and machine translation. arXiv bidirectional agreement. In Proceedings of the AAAI, 2019.
preprint arXiv:1609.08144, 2016. 165 Yang Zhao, Yining Wang, Jiajun Zhang, and Chengqing Zong. Phrase
144 Tong Xiao, Jingbo Zhu, Shujie Yao, and Hao Zhang. Document-level table as recommendation memory for neural machine translation. In
consistency verification in machine translation. Proceedings of the Proceedings of EMNLP, 2018.
13th Machine Translation Summit, 13:131–138, 2011. 166 Yang Zhao, Jiajun Zhang, Zhongjun He, Chengqing Zong, and Hua
145 Deyi Xiong, Yang Ding, Min Zhang, and Chew Lim Tan. Lexical Wu. Addressing troublesome words in neural machine translation. In
chain based cohesion models for document-level statistical machine Proceedings of EMNLP, pages 391–400, 2018.
translation. In Proceedings of EMNLP, pages 1563–1573, 2013. 167 Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Sim-
146 Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. Modeling pler and faster learning of adaptive policies for simultaneous transla-
coherence for discourse neural machine translation. In Proceedings tion. In Proceedings of EMNLP, 2019.
of the AAAI, volume 33, pages 7338–7345, 2019. 168 Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Si-
147 Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neu- multaneous translation with flexible policy via restricted imitation
ral machine translation with weight sharing. In Proceedings of ACL, learning. In Proceedings of ACL, 2019.
2018. 169 Zaixiang Zheng, Shujian Huang, Zhaopeng Tu, Xin-Yu Dai, and Ji-
148 Zhengxin Yang, Jinchao Zhang, Fandong Meng, Shuhao Gu, Yang ajun Chen. Dynamic past and future for neural machine translation.
Feng, and Jie Zhou. Enhancing context modeling with a query-guided In Proceedings of EMNLP-IJCNLP, 2019.
capsule network for document-level translation. 2019. 170 Zaixiang Zheng, Hao Zhou, Shujian Huang, Lei Li, Xin-Yu Dai, and
149 Jiali Zeng, Yang Liu, Jinsong Su, Yubing Ge, Yaojie Lu, Yongjing Jiajun Chen. Mirror-generative neural machine translation. In Pro-
Yin, and Jiebo Luo. Iterative dual domain adaptation for neural ma- ceedings of ICLR, 2020.
chine translation. In Proceedings of EMNLP-IJCNLP, 2019. 171 Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Ji-
150 Biao Zhang, Ivan Titov, and Rico Sennrich. Improving deep trans- ajun Chen, and Zhaopeng Tu. Modeling past and future for neural
former with depth-scaled initialization and merged attention. In Pro- machine translation. In TACL, volume 6, pages 145–157, 2018.
ceedings of EMNLP-IJCNLP, 2019. 172 Long Zhou, Jiajun Zhang, and Chengqing Zong. Synchronous bidi-
151 Biao Zhang, Deyi Xiong, Jinsong Su, and Jiebo Luo. Future- rectional neural machine translation. In TACL, 2019.
aware knowledge distillation for neural machine translation. 173 Long Zhou, Jiajun Zhang, Chengqing Zong, and Heng Yu. Sequence
IEEE/ACM Transactions on Audio, Speech, and Language Process- generation: From both sides to the middle. In Proceedings of IJCAI,
ing, 27(12):2278–2287, 2019. 2019.
152 Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong 174 Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou,
Sun. Prior knowledge integration for neural machine translation us- Houqiang Li, and Tie-Yan Liu. Incorporating bert into neural ma-
ing posterior regularization. In Proceedings of ACL, 2017. chine translation. In Proceedings of ICLR, 2020.
153 Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang 175 Chengqing Zong, Hua Wu, Taiyi Huang, and Bo Xu. Analysis on
Xu, Min Zhang, and Yang Liu. Improving the transformer transla- characteristics of chinese spoken language. In Proc. of 5th Natural
tion model with document-level context. In Proceedings of EMNLP, Language Processing Pacific Rim Symposium, pages 358–362, 1999.
2018. 176 Barret Zoph and Kevin Knight. Multi-source neural translation. In
154 Jiajun Zhang, Long Zhou, Yang Zhao, and Chengqing Zong. Syn- Proceedings of NAACL, 2016.
chronous bidirectional inference for neural sequence generation. Ar- 177 Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer
tificial Intelligence, page 103234, 2020. learning for low-resource neural machine translation. In Proceedings
155 Jiajun Zhang, Chengqing Zong, et al. Learning a phrase-based trans- of EMNLP, 2016.