0% found this document useful (0 votes)
61 views8 pages

Order-Planning Neural Text Generation From Structured Data

Uploaded by

karimhafez37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views8 pages

Order-Planning Neural Text Generation From Structured Data

Uploaded by

karimhafez37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Order-Planning Neural Text Generation From Structured Data

Lei Sha,1 Lili Mou,2 Tianyu Liu,1 Pascal Poupart,2 Sujian Li,1 Baobao Chang,1 Zhifang Sui1
1
Key Laboratory of Computational Linguistics, Ministry of Education; School of EECS, Peking Univeristy
2
David R. Cheriton School of Computer Science, University of Waterloo
1
{shalei, tianyu0421, lisujian, chbb, szf}@pku.edu.cn
2
[email protected], [email protected]

Table:
arXiv:1709.00155v1 [cs.CL] 1 Sep 2017

Abstract
ID Field Content
Generating texts from structured data (e.g., a table) is im- 1 Name Arthur Ignatius Conan Doyle
portant for various natural language processing tasks such as 2 Born 22 May 1859 Edinburgh, Scotland
question answering and dialog systems. In recent studies, re- 3 Died 7 July 1930 (aged 71) Crowborough, England
searchers use neural language models and encoder-decoder 4 Occupation Author, writer, physician
frameworks for table-to-text generation. However, these neu- 5 Nationality British
ral network-based approaches do not model the order of con- 6 Alma mater University of Edinburgh Medical School
tents during text generation. When a human writes a sum- 7 Genre Detective fiction fantasy
mary based on a given table, he or she would probably con- 8 Notable work Stories of Sherlock Homes
sider the content order before wording. In a biography, for
example, the nationality of a person is typically mentioned Text:
before occupation in a biography. In this paper, we propose Sir Arthur Ignatius Conan Doyle (22 May 1859 – 7 July 1930)
an order-planning text generation model to capture the rela- was a British writer best known for his detective fiction featuring
tionship between different fields and use such relationship to the character Sherlock Holmes.
make the generated text more fluent and smooth. We con-
ducted experiments on the W IKI B IO dataset and achieve sig-
Table 1: Example of a Wikipedia infobox and a reference
nificantly higher performance than previous methods in terms
of BLEU, ROUGE, and NIST scores. text.

Introduction This motivates order planning for neural text generation. In


Generating texts from structured data (e.g., a table) is im- other words, a neural network should model not only word
portant for various natural language processing tasks such order (as has been well captured by RNN) but also the order
as question answering and dialog systems. Table 1 shows an of contents, i.e., fields in a table.
example of a Wikipedia infobox (containing fields and val- We also observe from real summaries that table fields by
ues) and a text summary. themselves provide illuminating clues and constraints of text
In early years, text generation is usually accomplished by generation. In the biography domain, for example, the na-
human-designed rules and templates (Green 2006; Turner, tionality of a person is typically mentioned before the occu-
Sripada, and Reiter 2010), and hence the generated texts are pation. This could benefit from explicit planning of content
not flexible. Recently, researchers apply neural networks to order during neural text generation.
generate texts from structured data (Lebret, Grangier, and In this paper, we propose an order-planning method
Auli 2016), where a neural encoder captures table informa- for table-to-text generation. Our model is built upon the
tion and a recurrent neural network (RNN) decodes these encoder-decoder framework and use RNN for text synthe-
information to a natural language sentence. sis with attention to table entries. Different from exiting
Although such neural network-based approach is capable neural models, we design a table field linking mechanism,
of capturing complicated language and can be trained in an inspired by temporal memory linkage in the Differentiable
end-to-end fashion, it lacks explicit modeling of content or- Neural Computer (Graves et al. 2016, DNC). Our field link-
der during text generation. That is to say, an RNN gener- ing mechanism explicitly models the relationship between
ates a word at a time step conditioned on previous generated different fields, enabling our neural network to better plan
words as well as table information, which is more or less what to say first and what next. Further, we incorporate a
“shortsighted” and differs from how a human writer does. copy mechanism (Gu et al. 2016) into our model to cope
As suggested in the book The Elements of Style, with rare words.
A basic structural design underlies every kind of writ- We evaluated our method on the W IKI B IO dataset (Le-
ing . . . in most cases, planning must be a deliberate pre- bret, Grangier, and Auli 2016). Experimental results show
lude to writing. (William and White 1999) that our order-planning approach significantly outperforms
previous state-of-the-art results in terms of BLEU, ROUGE, (a) Encoder (b) Dispatcher
and NIST metrics. Extensive ablation tests verify the ef- Table Representation Planning What to Generate Next
fectiveness of each component in our model; we also per- Field Content
form visualization analysis to better understand the proposed Name Arthur
order-planning mechanism. Name Ignatius Content-based
Name Conan attention

Approach Name Doyle

LSTM
Born 22
Our model takes as input a table (e.g., a Wikipedia infobox) Weighted
Born May sum
and generates a natural language summary describing the
Born 1859
information based on an RNN. The neural network contains
Occupation writer
three main components: Attention
Occupation physician vector
• An encoder captures table information; Nationality British
• A dispatcher—a hybrid content- and linkage-based atten- Hybrid attention
tion mechanism over table contents—plans what to gen-
erate next; and
• A decoder generates a natural language summary using
RNN, where we also incorporate a copy mechanism (Gu Last step's
attention Link (sub)matrix
et al. 2016) to cope with rare words.
We elaborate these components in the rest of this section. Link-based
attention

Encoder: Table Representation


Figure 1: The (a) Encoder and (b) Dispatcher in our model.
We design a neural encoder to represent table information.
As shown in Figure 1, the content of each field is split
into separate words and the entire table is transformed into
a large sequence. Then we use a recurrent neural network
(RNN) with long short term memory (LSTM) units (Hochre- Conan Doyle”), whereas other fields contain a set of words
iter and Schmidhuber 1997) to read the contents as well as (e.g., Occupation = “writer, physician”). We do not have
their corresponding field names. much human engineering here, but let an RNN to capture
Concretely, let C be the number of content words in a such subtlety by itself.
table; let ci and fi be the embeddings of a content and its
corresponding field, respectively (i = 1 · · · C). The input of
LSTM-RNN is the concatenation of fi and ci , denoted as Dispatcher: Planning What to Generate Next
xi = [fi ; ci ], and the output, denoted as hi , is the encoded
information corresponding to a content word, i.e., After encoding table information, we use another RNN to
 
gin ; gforget ; gout = σ(Wg xi + Ug hi−1 ) (1) decode a natural language summary (deferred to the next
part). During the decoding process, the RNN is augmented
x
ei = tanh(Wx xi + Ux hi−1 ) (2) with a dispatcher that plans what to generate next.
hi = gin ◦ x
e ei + gforget ◦ h
e i−1 (3) Generally, a dispatcher is an attention mechanism over ta-
ble contents. At each decoding time step t, the dispatcher
hi = gout ◦ tanh(h
e i) (4)
computes a probabilistic distribution αt,i (i = 1 · · · C),
where ◦ denotes element-wise product, and σ denotes the which is further used for weighting content representa-
sigmoid function. W ’s and U ’s are weights, and bias terms tions hi . In our model, the dispatcher is a hybrid of content-
are omitted in the equations for clarity. gin , gforget , and gout and link-based attention, discussed in detail as follows.
are known as input, forget, and output gates.
Notice that, we have two separate embedding matrices for
fields and content words. We observe the field names of dif- Content-Based Attention. Traditionally, the computation
ferent data samples mostly come from a fixed set of candi- of attention αt,i is based on the content representation hi
dates, which is reasonable in a particular domain. Therefore, as well as some state during decoding (Bahdanau, Cho, and
we assign an embedding to a field, regardless of the number Bengio 2015; Mei, Bansal, and Walter 2016). We call this
of words in the field name. For example, the field Notable content-based attention, which is also one component in our
work in Table 1 is represented by a single field embedding dispatcher.
instead of the embeddings of notable and work. Since both the field name and the content contain im-
For content words, we represent them with conventional portant clues for text generation, we compute the attention
word embeddings (which are randomly initialized), and use weights based on not only the encoded vector of table con-
LSTM-RNN to integrate information. In a table, some fields tent hi but also the field embedding fi , thus obtaining the
content
contain a sequence of words (e.g., Name=“Arthur Ignatius final attention αt,i by re-weighting one with the other.
Formally, we have content words. Therefore, we do not require our link matrix
(f )
L to be a probabilistic distribution in each row, but normal-
et,i = fi> W (f ) yt−1 + b(f )

α (5) ize the probability afterwards by Equation 9, which turns out
(c) to work well empirically.
et,i = h> (c)
yt−1 + b(c)

α i W (6) Besides, we would like to point out that the link-based
 (f ) (c)
exp α et,i α
et,i attention is inspired by the Differentiable Neural Com-
content
αt,i = PC  (f ) (c) (7) puter (Graves et al. 2016, DNC). DNC contains a “linkage-
j=1 exp α et,j α
et,j based addressing” mechanism to track consecutively used
memory slots and thus to integrate order information during
where W (f ) , b(f ) , W (c) , b(c) are learnable parameters; fi memory addressing. Likewise, we design the link-based at-
and hi are vector representations of the field name and en- tention to capture the temporal order of different fields. But
content
coded content, respectively, for the ith row. αt,i is the different from the linking strength heuristically defined in
content-based attention weights. Ideally, a larger content- DNC, the link matrix in our model is directly parameterized
based attention indicates a more relevant content to the last and trained in an end-to-end manner.
generated word.
Hybrid Attention. To combine the above two attention
Link-Based Attention. We further propose a link-based mechanisms, we use a self-adaptive gate zt ∈ (0, 1) by a
attention mechanism that directly models the relationship sigmoid unit
between different fields.
(f )
zt = σ w> [h0t−1 ; et ; yt−1 ]

Our intuition stems from the observation that, a well- (10)
organized text typically has a reasonable order of its con-
tents. As illustrated previously, the nationality of a person where w is a parameter vector. h0t−1 is the last step’s hid-
is often mentioned before his occupation (e.g., a British den state of the decoder RNN. yt−1 is the embedding of
(f )
writer). Therefore, we propose an link-based attention to ex- the word generated in the last step; et is the sum of field
plicitly model such order information. embeddings fi weighted by the current step’s field attention
We construct a link matrix L ∈ Rnf ×nf , where nf is link
αt,i
(f )
. As yt−1 and et emphasize the content and link as-
the number possible field names in the dataset. An element pects, respectively, the self-adaptive gate z is aware of both.
L [fj , fi ] is a real-valued score indicating how likely the field In practice, we find z tends to address link-based attention
fj is mentioned after the field fi . (Here, [·, ·] indexes a ma- too much and thus adjust it by zet = 0.2zt + 0.5 empirically.
trix.) The link matrix L is a part of model parameters and Finally, the hybrid attention, a probabilistic distribution
learned by backpropagation. Although the link matrix ap- over all content words, is given by
pears to be large in size (1475×1475), a large number of its
elements are not used because most fields do not co-occur αhybrid
t = zet · αcontent
t + (1 − zet ) · αlink
t (11)
in at least one data sample; in total, we have 53422 effective
parameters here. In other scenarios, low-rank approximation Decoder: Sentence Generation
may be used to reduce the number of parameters. The decoder is an LSTM-RNN that predicts target words in
Formally, let αt−1,i (i = 1 . . . C) be an attention proba- sequence. We also have an attention mechanism (Bahdanau,
bility1 over table contents in the last time step during gener- Cho, and Bengio 2015) that summarizes source information,
ation. For a particular data sample whose content words are i.e., the table in our scenario, by weighted sum, yielding an
of fields f1 , f2 , · · · , fC , we first weight the linking scores attention vector at by
by the previous attention probability, and then normalize C
the weighted score to obtain link-based attention probabil- X hybrid
ity, given by at = αt,i hi (12)
i=1
X C 
where hi is the hidden representation obtained by the
link
αt,i = softmax αt−1,j · L [fj , fi ] (8) hybrid
j=1
table encoder. As αt,i is a probabilistic distribution—
 PC determined by both content and link information—over con-
exp αt−1,j · L [fj , fi0 ]
j=1 tent words, it enables the decoder RNN to focus on relevant
= PC (9) information at a time, serving as an order-planning mecha-
j αt−1,j · L [fj , fi ]
P
i0 =1 exp
0
nism for table-to-text generation.
Intuitively, the link matrix is analogous to the transition Then we concatenate the attention vector at and the em-
matrix in a Markov chain (Karlin 2014), whereas the term bedding of the last step’s generated word yt−1 , and use a
PC single-layer neural network to mix information before feed-
j=1 αt−1,j · L [fj , fi ] is similar to one step of transition ing to the decoder RNN. In other words, the decoder RNN’s
in the Markov chain. However, in our scenario, a table in a
input (denoted as xt ) is
particular data sample contains only a few fields, but a field
may occur several times because it contains more than one xt = tanh(Wd [at ; yt−1 ] + bd ) (13)
1
Here, αt−1,i refers to the hybrid content- and link-based atten- where Wd and bd are weights. Similar to Equations 1–4, at
tion, which will be introduced shortly. a time step t during decoding, the decoder RNN yields a
Decoder where 1{ci =w} is a Boolean variable indicating whether the
Sentence Generation content word ci is the same as the word w we are consider-
<eos> ing.
Finally, the LSTM score and the copy score are added for
a particular word and further normalized to obtain a proba-
LSTM LSTM . . . LSTM . . . LSTM bilistic distribution, given by
st (w) = sLSTM
t (w) + scopy
t (w) (17)
<start>
exp{st (w)}
pt (w) = softmax (st (w)) = (18)
exp{st (w0 )}
P
Table content w0 ∈V
S
C

where V refers to the vocabulary list and C refers to the set


of content words in a particular data sample. In this way,
the copy mechanism can either generate a word from the
vocabulary or directly copy a word from the source side.
This is hepful in our scenario because some fields in a table
(e.g., Name) may contain rare or unseen words and the copy
LSTM mechanism can cope with them naturally.
Last
LSTM For simplicity, we use greedy search during inference, i.e.,
state for each time step t, the word with the largest probability is
chosen, given by yt = argmaxw pt (w). The decoding pro-
cess terminates when a special symbol <eos> is generated,
indicating the end of a sequence.
Embedding of the Attention
generated word vector Training Objective
in last step
Our training objective is the negative log-likelihood of a sen-
tence y1 · · · yT in the training set.
Figure 2: The decoder RNN in our model, which is enhanced
with a copy mechanism. T
X
J =− log p(yt |y0 · · · yt−1 ) (19)
t=1

hidden representation h0t , based on which a score function where p(yt |·) is computed by Equation 18. An `2 penalty is
sLSTM is computed suggesting the next word to generate. also added as most other studies.
The score function is computed by Since all the components described above are differen-
tiable, our entire model can be trained end-to-end by back-
sLSTM
t = Ws h0t + bs (14)
propagation, and we use Adam (Kingma and Ba 2015) for
where h0t is the decoder RNN’s state. (Ws and bs are optimization.
weights.) The score function can be thought of as the input
of a softmax layer for classification before being normalized Experiments
to a probabilistic distribution. We incorporate a copy mech-
anism (Gu et al. 2016) into our approach, and the normal-
Dataset
ization is accomplished after considering a copying score, We used the newly published W IKI B IO dataset (Lebret,
introduced as follows. Grangier, and Auli 2016),2 which contains 728,321 biogra-
phies from WikiProject Biography3 (originally from English
Copy Mechanism. The copy mechanism scores a content Wikipedia, September 2015).
word ci by its hidden representation hi in the encoder side, Each data sample comprises an infobox table of field-
indicating how likely the content word ci is directly copied content pairs, being the input of our system. The generation
during target generation. That is, target is the first sentence in the biography, which follows the
st,i = σ(h> 0
i Wc )ht (15) setting in previous work (Lebret, Grangier, and Auli 2016).
Although only the first sentence is considered in the experi-
and st,i is a real number for i = 1, · · · , C (the number of ment, the sentence typically serves as a summary of the ar-
content words). Here Wc is a parameter matrix, and h0 is the ticle. In fact, the target sentence has 26.1 tokens on average,
decoding state. which is actually long. Also, the sentence contains informa-
In other words, when a word appears in the table content, tion spanning multiple fields, and hence our order-planning
it has a copying score computed as above. If a word w occurs mechanism is useful in this scenario.
multiple times in the table contents, the scores are added,
2
given by https://github.com/DavidGrangier/
C wikipedia-biography-dataset
st,i · 1{ci =w}
3
X
scopy
t (w) = (16) https://en.wikipedia.org/wiki/Wikipedia:
i=1
WikiProject_Biography
Group Model BLEU ROUGE NIST
Previous results KN 2.21 0.38 0.93
Template KN 19.80 10.70 5.19
Table NLMl 34.70 25.80 7.98
Our results Content attention only 41.38 34.65 8.57
Order planning (full model) 43.91 37.15 8.85

Table 2: Comparison of the overall performance between our model and previous methods. l Best results in Lebret, Grangier,
and Auli (2016).

We applied the standard data split: 80% for training and Component BLEU ROUGE NIST
10% for testing, except that model selection was performed
on a validaton subset of 1000 samples (based on BLEU-4). Content att. 41.38 34.65 8.57
Link att. 38.24 32.75 8.36
Hybrid att. 43.01 36.91 8.75
Settings
Copy+Content att. 41.89 34.93 8.63
We decapitalized all words and kept a vocabulary size of Copy+Link att. 39.08 33.47 8.42
20,000 for content words and generation candidates, which Copy+Hybrid att. 43.91 37.15 8.85
also followed previous work (Lebret, Grangier, and Auli
2016). Even with this reasonably large vocabulary size, we Table 3: Ablation test.
had more than 900k out-of-vocabulary words. This rational-
izes the use of the copy mechanism.
For the names of table fields, we treated each as a spe- and Auli (2016) because the copy mechanism makes the vo-
cial token. By removing nonsensical fields whose content is cabulary size vary among data samples, and thus the per-
“none” and grouping fields occurring less than 100 times as plexity is not comparable among different approaches.
an “Unknown” field, we had 1475 different field names in
total. Results
In our experiments, both words’ and table fields’ embed- Overall Performance. Table 2 compares the overall per-
dings were 400-dimensional and LSTM layers were 500- formance with previous work. We see that, modern neural
dimensional. Notice that, a field (e.g., “name”) and a con- networks are considerably better than traditional KN models
tent/generation word (e.g., also “name”), even with the same with or without templates. Moreover, our base model (with
string, were considered as different tokens; hence, they had content-attention only) outperforms Lebret, Grangier, and
different embeddings. We randomly initialized all embed- Auli (2016), showing our better engineering efforts. After
dings, which are tuned during training. adding up all proposed components, we obtain +2.5 BLEU
We used Adam (Kingma and Ba 2015) as the optimiza- and ROUGE improvement and +0.3 NIST improvement,
tion algorithm with a batch size of 32; other hyperparame- achieving new state-of-the-art results.
ters were set to default values.
Ablation Test. Table 3 provides an extensive ablation test
Baselines to verify the effectiveness of each component in our model.
The top half of the table shows the results without the copy
We compared our model with previous results using either mechanism, and the bottom half incorporates the copying
traditional language models or neural networks. score as described previously. We observe that the copy
mechasnim is consistently effective with different types of
• KN and Template KN (Heafield et al. 2013): Lebret,
attention.
Grangier, and Auli (2016) train an interpolated Kneser-
We then compare content-based attention and link-based
Ney (KN) language model for comparison with the
attention, as well as their hybrid (also Table 3). The results
KenLM toolkit. They also train a KN language model
show that, link-based attention alone is not as effective as
with templates.
content-based attention. However, we achieve better perfor-
• Table NLM: Lebret, Grangier, and Auli (2016) propose an mance if combining them together with an adaptive gate,
RNN-based model with attention and copy mechanisms. i.e., the proposed hybrid attention. The results are consistent
They have several model variants, and we quote the high- in both halves of Table 3 (with or without copying) and in
est reported results. terms of all metrics (BLEU, ROUGE, and NIST). This im-
plies that content-based attention and link-based attention do
We report model performance in terms of several met- capture different aspects of information, and their hybrid is
rics, namely BLEU-4, ROUGE-4, and NIST-4, which are more suited to the task of table-to-text generation.
computed by standard software, NIST mteval-v13a.pl (for
BLEU and NIST) and MSR rouge-1.5.5 (for ROUGE). We Effect of the gate. We are further interested in the effect
did not include the perplexity measure in Lebret, Grangier, of the gate z, which balances content-based attention αcontent
50.0 Feeding field info to. . . BLEU ROUGE NIST
None 41.89 34.93 8.63
45.0
Computation of αcontent 40.52 34.95 8.57
Decoder RNN’s input 41.96 35.07 8.61
BLEU

40.0
Hybrid att. (proposed) 43.91 37.15 8.85
35.0 Fixed z Table 4: Comparing different possible ways of using field
Self-adaptive z information. “None”: No field information is fed back to
30.0 the network, i.e., content-based attention computed by Equa-
0.0 0.2 0.4 0.6 0.8 1.0
z tion 7 (with copying).
Figure 3: Comparing the self-adaptive gate with interpola-
tion of content- and link-based attention. z = 0 is link-based
attention, z = 1 is content-based attention. word American is appropriate in the sentence, and corrupts
the phrase former governor of the federal reserve system as
appears in the reference. However, when link-based atten-
and link-based attention αlink . As defined in Equation 11, tion is added, the network is more aware of the order be-
the computation of z depends on the decoding state as well tween fields “Nationality” and “Occupation,” and generates
as table information; hence it is “self-adaptive.” We would the nationality American before the occupation economist.
like to verify if such adaptiveness is useful. To verify this, This process could also be visualized in Figure 4. Here, we
we designed a controlled experiment where the gate z was plot our model’s content-based attention, link-based atten-
manually assigned in advance and fixed during training. In tion and their hybrid. (The content- and link-based attention
other words, the setting was essentially a (fixed) interpola- probabilities may be different from those separately trained
tion between αcontent and αlink . Specifically, we tuned z from in the ablation test.) After generating “emmett john rice (
0 to 1 with a granularity of 0.1, and plot BLEU scores as the december 21, 1919 – march 10, 2011 ) was,” content-based
comparison metric in Figure 3. attention skips the nationality and focuses more on the oc-
As seen, interpolation of content- and link-based atten- cupation. Link-based attention, on the other hand, provides
tion is generally better than either single mechanism, which a strong clue suggesting to generate the nationality first and
again shows the effectiveness of hybrid attention. However, then occupation. In this way, the obtained sentence is more
the peak performance of simple interpolation (42.89 BLEU compliant with conventions.
when z = 0.4) is worse than the self-adaptive gate, imply-
ing that our gating mechanism can automatically adjust the Related Work
importance of αcontent and αlink at a particular time based on
the current state and input. Text generation has long aroused interest in the NLP com-
munity due to is wide applications including automated nav-
Different Ways of Using Field Information. We are cu- igation (Dale, Geldof, and Prost 2003) and weather forecast-
rious whether the proposed order-planning mechanism is ing (Reiter et al. 2005). Traditionally, text generation can be
better than other possible ways of using field information. divided into several steps (Stent, Prassad, and Walker 2004):
We conducted two controlled experiments as follows. Sim- content planning defines what information should be con-
ilar to the proposed approach, we multiplied the attention veyed in the generated sentence; (2) sentence planning de-
probability by a field matrix and thus obtained a weighted termines what to generate in each sentence; and (3) surface
field embedding. We fed it to either (1) the computation of realization actually generates those sentences with words.
content-based attention, i.e., Equations 5–6, or (2) the RNN In early years, surface realization is often accomplished
decoder’s input, i.e., Equation 13. In both cases, the last by templates (Van Deemter, Theune, and Krahmer 2005)
step’s weighted field embedding was concatenated with the or statistically learned (shallow) models, e.g., probabilis-
embedding of the generated word yt−1 . tic context-free grammar (Belz 2008) and language mod-
From Table 4, we see that feeding field information to the els (Angeli, Liang, and Klein 2010), with hand-crafted fea-
computation of αcontent interferes content attention and leads tures or rules. Therefore, these methods are weak in terms of
to performance degradation, and that feeding it to decoder the quality of generated sentences. For planning, researchers
RNN slightly improves model performance. However, both also apply (shallow) machine learning approaches. Barzilay
controlled experiments are worse than the proposed method. and Lapata (2005), for example, model it as a collective clas-
The results confirm that our order-planning mechanism is sification problem, whereas Liang, Jordan, and Klein (2009)
indeed useful in modeling the order of fields, outperforming use a generative semi-Markov model to align text segment
several other approaches that use the same field information and assigned meanings. Generally, planning and realization
in a naı̈ve fashion. in the above work are separate and have difficulty in captur-
ing the complexity of language due to the nature of shallow
Case Study and Visualization models.
We showcase an example in Table 5. With only content- Recently, the recurrent neural network (RNN) is playing
based attention, the network is confused about when the a key role in natural language generating. As RNN can au-
Name Emmett John Rice Reference emmett john rice ( december 21 , 1919 – march 10 , 2011 ) was a
Birth date December 21, 1919 former governor of the federal reserve system , a Cornell university
Birth place Florence, South Carolina, economics professor , expert in the monetary systems of developing
United States
Death date March 10, 2011 (aged 91)
countries and the father of the current national security advisor to
Death place Camas, Washington, president barack obama , susan e . rice .
United States
Nationality American
Content-based emmett john rice ( december 21 , 1919 – march 10 , 2011 ) was an
attention economist , author , public official and the former american governor
Occupation Governor of the Federal
Reserve System, of the federal reserve system , the first african american UNK .
Economics Professor
Hybrid attention emmett john rice ( december 21 , 1919 – march 10 , 2011 ) was an
Known for Expert in the Monetary
System of Developing
american economist , author , public official and the former governor
Countries, Father to of the federal reserve system , expert in the monetary systems of
Susan E. Rice developing countries .

Table 5: Case study. Left: Wikipedia infobox. Right: A reference and two generated sentences by different attention (both with
the copy mechanism).

(a) αcontent (b) αlink (c) αhybrid Our paper proposes an order-planning approach by de-
death place:united signing a hybrid of content- and link-based attention. The
death place:states model is inspired by hybrid content- and location-based ad-
nationality:american dressing in the Differentiable Neural Computer (Graves et
occupation:governor al. 2016, DNC), where the location-based addressing is de-
occupation:of
fined heuristically. Instead, we propose a transition-like link
occupation:the
occupation:federal
matrix that models how likely a field is mentioned after an-
occupation:reserve other, which is more suited in our scenario.
occupation:system Moreover, our entire model is differentiable, and thus the
occupation:, planning and realization steps in traditional language gener-
occupation:economics ation can be learned end-to-end in our model.
occupation:professor
known for:expert
Conclusion and Future Work
)
was
an

)
was
an

)
was
an
american
economist

american
economist

american
economist

In this paper, we propose an order-planning neural network


that generates texts from a table (Wikipedia infobox). The
text generation process is built upon an RNN with attention
Figure 4: Visualization of attention probabilities in our to table contents. Different from traditional content-based
model. x-axis: generated words “. . . ) was an american attention, we explicitly model the order of contents by a
economist . . . ”; y-axis: hfield : content wordi pairs in the link matrix, based on which we compute a link-based at-
table. (a) Content-based attention. (b) Link-based attention. tention. Then a self-adaptive gate balances the content- and
(c) Hybrid attention. Subplot (b) exhibits strips because, by link-based attention mechanisms. We further incorporate a
definition, link-based attention will yield the same score for copy mechanism to our model to cope with rare or unseen
all content words with the same field. Please also note that words.
the columns do not sum to 1 in the figure because we only We evaluated our approach on a newly proposed large
plot a part of the attention probabilities. scale dataset, W IKI B IO. Experimental results show that we
outperform previous results by a large margin in terms of
tomatically capture highly complicated patterns during end- BLEU, ROUGE, and NIST scores. We also had extensive
to-end training, it has successful applications including ma- ablation test showing the effectiveness of the copy mecha-
chine translation (Bahdanau, Cho, and Bengio 2015), dia- nism, as well as the hybrid attention of content and linking
log systems (Shang, Lu, and Li 2015), and text summariza- information. We compared our order-planning mechanism
tion (Tan, Wan, and Xiao 2017). with other possible ways of modeling field; the results con-
Researchers are then beginning to use RNN for text gener- firm that the proposed method is better than feeding field
ation from structured data. Mei, Bansal, and Walter (2016) embedding to the network in a naı̈ve fashion. Finally we pro-
propose a coarse-to-fine grained attention mechanism that vide a case study and visualize the attention scores so as to
selects one or more records (e.g., a piece of weather forecast) better understand our model.
by a precomputed but fixed probability and then dynamically In future work, we would like to deal with text genera-
attends to relevant contents during decoding. Lebret, Grang- tion from multiple tables. In particular, we would design hi-
ier, and Auli (2016) incorporate the copy mechanism (Gu erarchical attention mechanisms that can first select a table
et al. 2016) into the generation process. However, the above containing the information and then select a field for gen-
approaches do not explicitly model the order of contents. It eration, which would improve the attention efficiency. We
is also nontrivial to combine traditional planning techniques would also like to apply the proposed method to text gener-
to such end-to-end learned RNN. ation from other structured data, e.g., a knowledge graph.
Acknowledgments [Liang, Jordan, and Klein 2009] Liang, P.; Jordan, M. I.; and
We thank Jing He from AdeptMind.ai for helpful discus- Klein, D. 2009. Learning semantic correspondences with
sions on different ways of using field information. less supervision. In Proceedings of the Joint Conference of
the 47th Annual Meeting of the ACL and the 4th Interna-
References tional Joint Conference on Natural Language Processing of
the AFNLP, 91–99.
[Angeli, Liang, and Klein 2010] Angeli, G.; Liang, P.; and
Klein, D. 2010. A simple domain-independent probabilistic [Mei, Bansal, and Walter 2016] Mei, H.; Bansal, M.; and
approach to generation. In Proceedings of the Conference on Walter, M. R. 2016. What to talk about and how? Selec-
Empirical Methods in Natural Language Processing, 502– tive generation using LSTMs with coarse-to-fine alignment.
512. In Proceedings of the Conference of the North American
Chapter of the Association for Computational Linguistics:
[Bahdanau, Cho, and Bengio 2015] Bahdanau, D.; Cho, K.;
Human Language Technologies, 720–730.
and Bengio, Y. 2015. Neural machine translation by jointly
learning to align and translate. In Proceedings of the Inter- [Reiter et al. 2005] Reiter, E.; Sripada, S.; Hunter, J.; Yu, J.;
national Conference on Learning Representations. and Davy, I. 2005. Choosing words in computer-generated
weather forecasts. Artificial Intelligence 167(1-2):137–169.
[Barzilay and Lapata 2005] Barzilay, R., and Lapata, M.
2005. Collective content selection for concept-to-text gener- [Shang, Lu, and Li 2015] Shang, L.; Lu, Z.; and Li, H. 2015.
ation. In Proceedings of Human Language Technology Con- Neural responding machine for short-text conversation. In
ference and Conference on Empirical Methods in Natural Proceedings of the 53rd Annual Meeting of the Associa-
Language Processing, 331–338. tion for Computational Linguistics and the 7th International
[Belz 2008] Belz, A. 2008. Automatic generation of weather Joint Conference on Natural Language Processing, 1577–
forecast texts using comprehensive probabilistic generation- 1586.
space models. Natural Language Engineering 14(4):431– [Stent, Prassad, and Walker 2004] Stent, A.; Prassad, R.; and
455. Walker, M. 2004. Trainable sentence planning for complex
[Dale, Geldof, and Prost 2003] Dale, R.; Geldof, S.; and information presentations in spoken dialog systems. In Pro-
Prost, J.-P. 2003. CORAL: Using natural language genera- ceedings of the 42nd Meeting of the Association for Compu-
tion for navigational assistance. In Proceedings of the 26th tational Linguistics, 79–86.
Australasian Computer Science Conference, volume 16, 35– [Tan, Wan, and Xiao 2017] Tan, J.; Wan, X.; and Xiao, J.
44. 2017. Abstractive document summarization with a graph-
[Graves et al. 2016] Graves, A.; Wayne, G.; Reynolds, M.; based attentional neural model. In Proceedings of the 55th
et al. 2016. Hybrid computing using a neural network with Annual Meeting of the Association for Computational Lin-
dynamic external memory. Nature 538(7626):471–476. guistics, 1171–1181.
[Green 2006] Green, N. 2006. Generation of biomedical ar- [Turner, Sripada, and Reiter 2010] Turner, R.; Sripada, S.;
guments for lay readers. In Proceedings of the 4th Interna- and Reiter, E. 2010. Generating approximate geographic
tional Natural Language Generation Conference, 114–121. descriptions. In Empirical Methods in Natural Language
Generation, 121–140.
[Gu et al. 2016] Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016.
Incorporating copying mechanism in sequence-to-sequence [Van Deemter, Theune, and Krahmer 2005] Van Deemter,
learning. In Proceedings of the 54th Annual Meeting of the K.; Theune, M.; and Krahmer, E. 2005. Real versus
Association for Computational Linguistics, 1631–1640. template-based natural language generation: A false
opposition? Computational Linguistics 31(1):15–24.
[Heafield et al. 2013] Heafield, K.; Pouzyrevsky, I.; Clark,
J. H.; and Koehn, P. 2013. Scalable modified Kneser-Ney [William and White 1999] William, S., and White, E. B.
language model estimation. In Proceedings of the 51st An- 1999. The Element of Style. Pearson, 4th edition.
nual Meeting of the Association for Computational Linguis-
tics, volume 2, 690–696.
[Hochreiter and Schmidhuber 1997] Hochreiter, S., and
Schmidhuber, J. 1997. Long short-term memory. Neural
Computation 9(8):1735–1780.
[Karlin 2014] Karlin, S. 2014. A First Course in Stochastic
Processes. Academic Press.
[Kingma and Ba 2015] Kingma, D., and Ba, J. 2015. Adam:
A method for stochastic optimization. In Proceedings of the
International Conference on Learning Representations.
[Lebret, Grangier, and Auli 2016] Lebret, R.; Grangier, D.;
and Auli, M. 2016. Neural text generation from structured
data with application to the biography domain. In Proceed-
ings of the Conference on Empirical Methods in Natural
Language Processing, 1203–1213.

You might also like