Order-Planning Neural Text Generation From Structured Data
Order-Planning Neural Text Generation From Structured Data
Lei Sha,1 Lili Mou,2 Tianyu Liu,1 Pascal Poupart,2 Sujian Li,1 Baobao Chang,1 Zhifang Sui1
1
Key Laboratory of Computational Linguistics, Ministry of Education; School of EECS, Peking Univeristy
2
David R. Cheriton School of Computer Science, University of Waterloo
1
{shalei, tianyu0421, lisujian, chbb, szf}@pku.edu.cn
2
[email protected], [email protected]
Table:
arXiv:1709.00155v1 [cs.CL] 1 Sep 2017
Abstract
ID Field Content
Generating texts from structured data (e.g., a table) is im- 1 Name Arthur Ignatius Conan Doyle
portant for various natural language processing tasks such as 2 Born 22 May 1859 Edinburgh, Scotland
question answering and dialog systems. In recent studies, re- 3 Died 7 July 1930 (aged 71) Crowborough, England
searchers use neural language models and encoder-decoder 4 Occupation Author, writer, physician
frameworks for table-to-text generation. However, these neu- 5 Nationality British
ral network-based approaches do not model the order of con- 6 Alma mater University of Edinburgh Medical School
tents during text generation. When a human writes a sum- 7 Genre Detective fiction fantasy
mary based on a given table, he or she would probably con- 8 Notable work Stories of Sherlock Homes
sider the content order before wording. In a biography, for
example, the nationality of a person is typically mentioned Text:
before occupation in a biography. In this paper, we propose Sir Arthur Ignatius Conan Doyle (22 May 1859 – 7 July 1930)
an order-planning text generation model to capture the rela- was a British writer best known for his detective fiction featuring
tionship between different fields and use such relationship to the character Sherlock Holmes.
make the generated text more fluent and smooth. We con-
ducted experiments on the W IKI B IO dataset and achieve sig-
Table 1: Example of a Wikipedia infobox and a reference
nificantly higher performance than previous methods in terms
of BLEU, ROUGE, and NIST scores. text.
LSTM
Born 22
Our model takes as input a table (e.g., a Wikipedia infobox) Weighted
Born May sum
and generates a natural language summary describing the
Born 1859
information based on an RNN. The neural network contains
Occupation writer
three main components: Attention
Occupation physician vector
• An encoder captures table information; Nationality British
• A dispatcher—a hybrid content- and linkage-based atten- Hybrid attention
tion mechanism over table contents—plans what to gen-
erate next; and
• A decoder generates a natural language summary using
RNN, where we also incorporate a copy mechanism (Gu Last step's
attention Link (sub)matrix
et al. 2016) to cope with rare words.
We elaborate these components in the rest of this section. Link-based
attention
hidden representation h0t , based on which a score function where p(yt |·) is computed by Equation 18. An `2 penalty is
sLSTM is computed suggesting the next word to generate. also added as most other studies.
The score function is computed by Since all the components described above are differen-
tiable, our entire model can be trained end-to-end by back-
sLSTM
t = Ws h0t + bs (14)
propagation, and we use Adam (Kingma and Ba 2015) for
where h0t is the decoder RNN’s state. (Ws and bs are optimization.
weights.) The score function can be thought of as the input
of a softmax layer for classification before being normalized Experiments
to a probabilistic distribution. We incorporate a copy mech-
anism (Gu et al. 2016) into our approach, and the normal-
Dataset
ization is accomplished after considering a copying score, We used the newly published W IKI B IO dataset (Lebret,
introduced as follows. Grangier, and Auli 2016),2 which contains 728,321 biogra-
phies from WikiProject Biography3 (originally from English
Copy Mechanism. The copy mechanism scores a content Wikipedia, September 2015).
word ci by its hidden representation hi in the encoder side, Each data sample comprises an infobox table of field-
indicating how likely the content word ci is directly copied content pairs, being the input of our system. The generation
during target generation. That is, target is the first sentence in the biography, which follows the
st,i = σ(h> 0
i Wc )ht (15) setting in previous work (Lebret, Grangier, and Auli 2016).
Although only the first sentence is considered in the experi-
and st,i is a real number for i = 1, · · · , C (the number of ment, the sentence typically serves as a summary of the ar-
content words). Here Wc is a parameter matrix, and h0 is the ticle. In fact, the target sentence has 26.1 tokens on average,
decoding state. which is actually long. Also, the sentence contains informa-
In other words, when a word appears in the table content, tion spanning multiple fields, and hence our order-planning
it has a copying score computed as above. If a word w occurs mechanism is useful in this scenario.
multiple times in the table contents, the scores are added,
2
given by https://github.com/DavidGrangier/
C wikipedia-biography-dataset
st,i · 1{ci =w}
3
X
scopy
t (w) = (16) https://en.wikipedia.org/wiki/Wikipedia:
i=1
WikiProject_Biography
Group Model BLEU ROUGE NIST
Previous results KN 2.21 0.38 0.93
Template KN 19.80 10.70 5.19
Table NLMl 34.70 25.80 7.98
Our results Content attention only 41.38 34.65 8.57
Order planning (full model) 43.91 37.15 8.85
Table 2: Comparison of the overall performance between our model and previous methods. l Best results in Lebret, Grangier,
and Auli (2016).
We applied the standard data split: 80% for training and Component BLEU ROUGE NIST
10% for testing, except that model selection was performed
on a validaton subset of 1000 samples (based on BLEU-4). Content att. 41.38 34.65 8.57
Link att. 38.24 32.75 8.36
Hybrid att. 43.01 36.91 8.75
Settings
Copy+Content att. 41.89 34.93 8.63
We decapitalized all words and kept a vocabulary size of Copy+Link att. 39.08 33.47 8.42
20,000 for content words and generation candidates, which Copy+Hybrid att. 43.91 37.15 8.85
also followed previous work (Lebret, Grangier, and Auli
2016). Even with this reasonably large vocabulary size, we Table 3: Ablation test.
had more than 900k out-of-vocabulary words. This rational-
izes the use of the copy mechanism.
For the names of table fields, we treated each as a spe- and Auli (2016) because the copy mechanism makes the vo-
cial token. By removing nonsensical fields whose content is cabulary size vary among data samples, and thus the per-
“none” and grouping fields occurring less than 100 times as plexity is not comparable among different approaches.
an “Unknown” field, we had 1475 different field names in
total. Results
In our experiments, both words’ and table fields’ embed- Overall Performance. Table 2 compares the overall per-
dings were 400-dimensional and LSTM layers were 500- formance with previous work. We see that, modern neural
dimensional. Notice that, a field (e.g., “name”) and a con- networks are considerably better than traditional KN models
tent/generation word (e.g., also “name”), even with the same with or without templates. Moreover, our base model (with
string, were considered as different tokens; hence, they had content-attention only) outperforms Lebret, Grangier, and
different embeddings. We randomly initialized all embed- Auli (2016), showing our better engineering efforts. After
dings, which are tuned during training. adding up all proposed components, we obtain +2.5 BLEU
We used Adam (Kingma and Ba 2015) as the optimiza- and ROUGE improvement and +0.3 NIST improvement,
tion algorithm with a batch size of 32; other hyperparame- achieving new state-of-the-art results.
ters were set to default values.
Ablation Test. Table 3 provides an extensive ablation test
Baselines to verify the effectiveness of each component in our model.
The top half of the table shows the results without the copy
We compared our model with previous results using either mechanism, and the bottom half incorporates the copying
traditional language models or neural networks. score as described previously. We observe that the copy
mechasnim is consistently effective with different types of
• KN and Template KN (Heafield et al. 2013): Lebret,
attention.
Grangier, and Auli (2016) train an interpolated Kneser-
We then compare content-based attention and link-based
Ney (KN) language model for comparison with the
attention, as well as their hybrid (also Table 3). The results
KenLM toolkit. They also train a KN language model
show that, link-based attention alone is not as effective as
with templates.
content-based attention. However, we achieve better perfor-
• Table NLM: Lebret, Grangier, and Auli (2016) propose an mance if combining them together with an adaptive gate,
RNN-based model with attention and copy mechanisms. i.e., the proposed hybrid attention. The results are consistent
They have several model variants, and we quote the high- in both halves of Table 3 (with or without copying) and in
est reported results. terms of all metrics (BLEU, ROUGE, and NIST). This im-
plies that content-based attention and link-based attention do
We report model performance in terms of several met- capture different aspects of information, and their hybrid is
rics, namely BLEU-4, ROUGE-4, and NIST-4, which are more suited to the task of table-to-text generation.
computed by standard software, NIST mteval-v13a.pl (for
BLEU and NIST) and MSR rouge-1.5.5 (for ROUGE). We Effect of the gate. We are further interested in the effect
did not include the perplexity measure in Lebret, Grangier, of the gate z, which balances content-based attention αcontent
50.0 Feeding field info to. . . BLEU ROUGE NIST
None 41.89 34.93 8.63
45.0
Computation of αcontent 40.52 34.95 8.57
Decoder RNN’s input 41.96 35.07 8.61
BLEU
40.0
Hybrid att. (proposed) 43.91 37.15 8.85
35.0 Fixed z Table 4: Comparing different possible ways of using field
Self-adaptive z information. “None”: No field information is fed back to
30.0 the network, i.e., content-based attention computed by Equa-
0.0 0.2 0.4 0.6 0.8 1.0
z tion 7 (with copying).
Figure 3: Comparing the self-adaptive gate with interpola-
tion of content- and link-based attention. z = 0 is link-based
attention, z = 1 is content-based attention. word American is appropriate in the sentence, and corrupts
the phrase former governor of the federal reserve system as
appears in the reference. However, when link-based atten-
and link-based attention αlink . As defined in Equation 11, tion is added, the network is more aware of the order be-
the computation of z depends on the decoding state as well tween fields “Nationality” and “Occupation,” and generates
as table information; hence it is “self-adaptive.” We would the nationality American before the occupation economist.
like to verify if such adaptiveness is useful. To verify this, This process could also be visualized in Figure 4. Here, we
we designed a controlled experiment where the gate z was plot our model’s content-based attention, link-based atten-
manually assigned in advance and fixed during training. In tion and their hybrid. (The content- and link-based attention
other words, the setting was essentially a (fixed) interpola- probabilities may be different from those separately trained
tion between αcontent and αlink . Specifically, we tuned z from in the ablation test.) After generating “emmett john rice (
0 to 1 with a granularity of 0.1, and plot BLEU scores as the december 21, 1919 – march 10, 2011 ) was,” content-based
comparison metric in Figure 3. attention skips the nationality and focuses more on the oc-
As seen, interpolation of content- and link-based atten- cupation. Link-based attention, on the other hand, provides
tion is generally better than either single mechanism, which a strong clue suggesting to generate the nationality first and
again shows the effectiveness of hybrid attention. However, then occupation. In this way, the obtained sentence is more
the peak performance of simple interpolation (42.89 BLEU compliant with conventions.
when z = 0.4) is worse than the self-adaptive gate, imply-
ing that our gating mechanism can automatically adjust the Related Work
importance of αcontent and αlink at a particular time based on
the current state and input. Text generation has long aroused interest in the NLP com-
munity due to is wide applications including automated nav-
Different Ways of Using Field Information. We are cu- igation (Dale, Geldof, and Prost 2003) and weather forecast-
rious whether the proposed order-planning mechanism is ing (Reiter et al. 2005). Traditionally, text generation can be
better than other possible ways of using field information. divided into several steps (Stent, Prassad, and Walker 2004):
We conducted two controlled experiments as follows. Sim- content planning defines what information should be con-
ilar to the proposed approach, we multiplied the attention veyed in the generated sentence; (2) sentence planning de-
probability by a field matrix and thus obtained a weighted termines what to generate in each sentence; and (3) surface
field embedding. We fed it to either (1) the computation of realization actually generates those sentences with words.
content-based attention, i.e., Equations 5–6, or (2) the RNN In early years, surface realization is often accomplished
decoder’s input, i.e., Equation 13. In both cases, the last by templates (Van Deemter, Theune, and Krahmer 2005)
step’s weighted field embedding was concatenated with the or statistically learned (shallow) models, e.g., probabilis-
embedding of the generated word yt−1 . tic context-free grammar (Belz 2008) and language mod-
From Table 4, we see that feeding field information to the els (Angeli, Liang, and Klein 2010), with hand-crafted fea-
computation of αcontent interferes content attention and leads tures or rules. Therefore, these methods are weak in terms of
to performance degradation, and that feeding it to decoder the quality of generated sentences. For planning, researchers
RNN slightly improves model performance. However, both also apply (shallow) machine learning approaches. Barzilay
controlled experiments are worse than the proposed method. and Lapata (2005), for example, model it as a collective clas-
The results confirm that our order-planning mechanism is sification problem, whereas Liang, Jordan, and Klein (2009)
indeed useful in modeling the order of fields, outperforming use a generative semi-Markov model to align text segment
several other approaches that use the same field information and assigned meanings. Generally, planning and realization
in a naı̈ve fashion. in the above work are separate and have difficulty in captur-
ing the complexity of language due to the nature of shallow
Case Study and Visualization models.
We showcase an example in Table 5. With only content- Recently, the recurrent neural network (RNN) is playing
based attention, the network is confused about when the a key role in natural language generating. As RNN can au-
Name Emmett John Rice Reference emmett john rice ( december 21 , 1919 – march 10 , 2011 ) was a
Birth date December 21, 1919 former governor of the federal reserve system , a Cornell university
Birth place Florence, South Carolina, economics professor , expert in the monetary systems of developing
United States
Death date March 10, 2011 (aged 91)
countries and the father of the current national security advisor to
Death place Camas, Washington, president barack obama , susan e . rice .
United States
Nationality American
Content-based emmett john rice ( december 21 , 1919 – march 10 , 2011 ) was an
attention economist , author , public official and the former american governor
Occupation Governor of the Federal
Reserve System, of the federal reserve system , the first african american UNK .
Economics Professor
Hybrid attention emmett john rice ( december 21 , 1919 – march 10 , 2011 ) was an
Known for Expert in the Monetary
System of Developing
american economist , author , public official and the former governor
Countries, Father to of the federal reserve system , expert in the monetary systems of
Susan E. Rice developing countries .
Table 5: Case study. Left: Wikipedia infobox. Right: A reference and two generated sentences by different attention (both with
the copy mechanism).
(a) αcontent (b) αlink (c) αhybrid Our paper proposes an order-planning approach by de-
death place:united signing a hybrid of content- and link-based attention. The
death place:states model is inspired by hybrid content- and location-based ad-
nationality:american dressing in the Differentiable Neural Computer (Graves et
occupation:governor al. 2016, DNC), where the location-based addressing is de-
occupation:of
fined heuristically. Instead, we propose a transition-like link
occupation:the
occupation:federal
matrix that models how likely a field is mentioned after an-
occupation:reserve other, which is more suited in our scenario.
occupation:system Moreover, our entire model is differentiable, and thus the
occupation:, planning and realization steps in traditional language gener-
occupation:economics ation can be learned end-to-end in our model.
occupation:professor
known for:expert
Conclusion and Future Work
)
was
an
)
was
an
)
was
an
american
economist
american
economist
american
economist