Summarization
Summarization
net/publication/299863690
CITATIONS READS
13 1,359
7 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Rafael Ferreira on 18 April 2016.
191
ated with a weight between them. The sum of the weights of moval and stemming. Each text paragraph is numbered, as
a sentence is its score. This article analyzes 15 sentence scor- well as each of their sentences. Then, sentence segmenta-
ing methods, and some variation of them, widely used and tion is performed by Stanford CoreNLP1 . Stop words [5] are
referenced in the literature applied to document summariza- removed since they are considered unimportant and can in-
tion in the last 10 years. The scoring methods comprise the dicate noise. Stop Words are predefined and stored in an
feature vector that will be used to train the classifier and array that is used for comparison with the words in the doc-
to rank sentences, totaling 20 features. The key point in ument. Word Stemming [13] converts each word in its root
this paper is to use Machine Learning techniques to analyze form, removing its prefix and suffix of the specified word is
such features in a way to point out which of them better performed. After this stage the text is structured in XML
contribute to yield good quality summaries. and included in the XML file that corresponds to the news
Quantitative and qualitative strategies are used here as article. As the focus here is in the text part of the docu-
ways of assessing the quality of summaries. The quantitative ment for summarization all other XML-file attributes will
assessment was performed using ROUGE (Recall-Oriented no longer be addressed in this paper.
Understudy for Gisting Evaluation) [9], a measure widely
accepted for such a purpose. In addition, another quan- 3.2 Feature Extraction
titative analysis was performed by three people who ana- The XML document after preprocessing is represented by
lyzed each original text and generated summaries following the set D = {S1 , S2 , ..., Si }, where Si is a sentence in the
a methodology that is better described below. The qualita- document D. The preprocessed sentences are subjected to
tive assessment is made by counting the number of sentences the feature extraction process so that a feature vector is
selected by the system that coincides with the sentences se- generated for each sentence, Vi = {F1 , F2 , ..., Fi }, where Vi
lected by all the three human users. The results obtained is the feature vector of each sentence Si . As already men-
shows the effectiveness of the proposed method. It selects tioned extractive summarization use three scoring strategies
two times more relevant sentences to compose the summary. [4]: (i) Word : it assigns scores to the most important words;
Moreover, it achieves results 71% better in evaluation using (ii) Sentence: it accounts for features of sentences, such as
ROUGE 2 metric. its position in the document, similarity to the title, etc; (iii)
Graphic: it uses the relationship between words and sen-
tences.
2. THE CNN CORPUS Table 1 shows the features analyzed in this work and
The CNN corpus developed by Lins and his colleagues their kind of scoring. They correspond to the most widely
[9] consists of news texts extracted from the CNN website acknowledged techniques for extractive summarization re-
(www.cnn.com). The main advantage of this test corpus ported in the literature.
rests not only on the high quality of the writing using gram-
matically correct standard English to report on general in-
terest subjects, but each of the texts of the news article Table 1: Number of summaries sentences into gold stan-
is provided with its highlights, which consists of a 3 to 5 dard.
sentences long summary written by the original author(s). Feature Name of Extractive Type of
The highlights were the basis for the development the gold Summarization Strategy Scoring
standard, which was obtained by the injective mapping of F01 Aggregate Similarity Graph
each of the sentences in the highlights onto the original sen- F02 Bushy Path Graph
tences of the text. Such mapping process was performed F03 Centrality Sentence
by three different people. The gold standard was formed F04 Heterogeneous Graph Graph
with most voted mapped sentences chosen. A very high de- F05 Text Rank Graph
gree of consistency in sentence selection was observed. The F06 Cue-Phrase Sentence
CNN-corpus is possibly the largest existing corpus for bench- F07 Numerical Data Sentence
marking extractive summarization techniques. The current F08 Position Paragraph Sentence
version has 400 documents, written in the English language, F09 Position Text Sentence
totaling 13,228 sentences, of which 1,471 were selected for F10 Resemblance Title Sentence
the gold standards, representing an average compression rate
F11 Sentence Length Sentence
of 90%.
F12 Sentence Position in Paragraph Sentence
F13 Sentence Position in Text Sentence
3. THE SYSTEM F14 Proper-Noun Word
The steps for creating the methodology for obtaining the F15 Co-Occurrence Bleu Word
extractive summaries are presented in the following sections. F16 Lexical Similarity Word
F17 Co-Occurrence N-gram Word
3.1 Text pre-processing F18 TF/IDF Word
The news articles obtained from the CNN website must F19 Upper Case Word
be carefully chosen in order to contain only text, thus news F20 Word Frequency Word
articles with figures, videos, tables and other multi-media
elements are discarded. Besides that, the article must be
“complete” with the text, highlights, title, author(s), sub-
ject area, etc. All such data is inserted in a XML file.
The text part of the document in then processed for para-
1
graph segmentation, sentence segmentation, stop word re- http://nlp.stanford.edu/software/corenlp.shtml
192
3.3 Classification model
The steps for creating the classification model used to se-
lect the sentences that will compose the summary are de-
tailed here.
The first step has the purpose of reducing the problems
inherent to feature extraction of each sentence. First, the
feature vectors that have missing information and outliers
(when all features reach the maximum value) are eliminated.
Another problem addressed here is basis unbalance, when-
ever there is a large disparity in the number of data of the
training classes, the problem known in the literature as a
problem of class imbalance arises. Classification models that
are optimized with respect to overall accuracy tend to create
trivial models that almost always predict the majority class.
The algorithm chosen to address the problem of balanc- Figure 1: Selected Features
ing was SMOTE [3]. The principle of the algorithm is to
create artificial data based on spatial features between ex-
amples of the minority class. Specifically, for a subset (whole Text Summarizer (OTS), Text Compactor (TC), Free Sum-
minority class), consider the k nearest neighbors for each in- marizer (FS), Smmry (SUMM), Web Summarizer (WEB),
stance belonging to k for some integer value. Depending Intellexer Summarizer (INT)3 , Compendium (COMP) [10].
on the amount of oversampling chosen k nearest neighbors Figure 2 presents the proposed summarization method,
are randomly chosen. Synthetic samples are generated as showing the number of correct sentences chosen from the
follows: Calculate the Euclidean distance between the vec- human selected sentences that form the gold standard. This
tor of points (samples) in question and its nearest neighbor. experiment used 400 texts from CNN news.
Multiply this distance by a random number between zero
and one and add the vector points into consideration. This
causes the selection of a point along a line between the two
points selected. This approach effectively makes the region
the minority class becomes harder to become more general
[3].
Then, the system perform a feature selection, which is
an important tool for reducing the dimensionality of the
vectors, as some features contribute to decreasing the effi-
ciency of the classifier. Another contribution of this study
is to identify which of the 20 most used features in the last
10 years in the problems of extractive summarization con-
tribute effectively to a good performance of classifiers. The
experiment was conducted under the corpus of 400 news
CNN-English.
The experiments were performed with selection algorithms
of WEKA2 , three were chosen and applied on the balanced
basis for defining the best attributes of the vector. Below,
the methods of selection of attributes are listed: (i) CFS
Subset Evaluator: Evaluates the worth of a subset of at-
tributes by considering the individual predictive ability of
each feature along with the degree of redundancy between Figure 2: Evaluation of the classifiers for summa-
them; (ii) Information Gain Evaluator: Evaluates the rization
worth of an attribute by measuring the information gain
with respect to the class; (iii) SVM Attribute: Evaluates The classifiers were tested with variations of parameters
the worth of an attribute by using an SVM classifier. The with and without adjustment and balancing of the base.
top five characteristics indicated by the selection methods The technique chosen to validate the models was the Cross-
were chosen. Figure 1 shows the profile of the selected fea- Validation. The tests performed with the unbalanced ba-
tures. sis yielded an accuracy of 52% and balanced with the base
The selected features demonstrate the prevalence of lan- yielded 70% accuracy. The Naive Bayes classifier achieve the
guage independent features such as the position of text, best result in all cases. In qualitative evaluation it reach 969
TF/IDF and similarity. This allows summarization texts and 1082 correct sentences selected to the summary on un-
in different languages. balanced and balanced cases respectively. In the first case
Five classifiers were tested using the WEKA platform: Naive Bayes outperforms in 7.42% the second place (Ada
Naive Bayes [8], MLP [7], SVM [7], KNN [1], Ada Boost Boost) and it selects the same number of important sen-
[6], and Random Forest [2]. The results of the classifiers tences of KNN on balanced case.
were compared with seven summarization systems: Open 3
libots.sourceforge.net, www.textcompactor.com, freesum-
marizer.com, smmry.com, www.websummarizer.com, sum-
2
http://www.cs.waikato.ac.nz/ml/weka/ marizer.intellexer.com
193
Figure 3 and 4 presents the comparison of the Naive Bayes which means an improvement of more than 100%, in rela-
classifier results against the seven summarization systems. tion to the best tool found in literature. It was also evident
The superiority of the proposed method was proved on both that the balancing judgment on the basis of examples yields
evaluation. In the qualitative assessment the proposed method gains in the performance of the sentence selection system.
reach 1082 correct sentences selected, which means an im- The next step is the validation of the experiments in other
provement of more than 100% in relation to Text Compactor summarization test corpora for texts other than news arti-
the best tool found in the literature. In number it obtained cles. Although the CNN-corpus may possibly be the largest
554 more correct sentences. Using ROUGE the Naive Bayes and best test corpus for assessing news articles today, the
Classifier achieve a result 61.3% better than Web Summa- authors of this paper are promoting an effort to double its
rizer, the second place. The proposed method reach 71% size in the near future, allowing even better testing capabil-
of accuracy while WEB obtained 44%. These results con- ities.
firms the hypothesis that using machine learning technics
improves the text summarization results. 5. ACKNOWLEDGMENTS
The research results reported in this paper have been
partly funded by a R&D project between Hewlett-Packard
do Brazil and UFPE originated from tax exemption (IPI -
Law number 8.248, of 1991 and later updates).
6. REFERENCES
[1] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based
learning algorithms. Mach. Learn., 6(1):37–66, Jan. 1991.
[2] L. Breiman. Random forests. Mach. Learn., 45(1):5–32,
Oct. 2001.
[3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
Kegelmeyer. Smote: Synthetic minority over-sampling
technique. J. Artif. Int. Res., 16(1):321–357, June 2002.
[4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva,
F. Freitas, G. D. Cavalcanti, R. Lima, S. J. Simske, and
L. Favaro. Assessing sentence scoring techniques for
extractive text summarization. Expert Systems with
Figure 3: Evaluation of the summarization systems Applications, 40(14):5755 – 5764, 2013.
[5] W. B. Frakes and R. Baeza-Yates, editors. Information
Retrieval: Data Structures and Algorithms. Prentice-Hall,
Inc., Upper Saddle River, NJ, USA, 1992.
[6] Y. Freund and R. E. Schapire. Experiments with a new
boosting algorithm. In International Conference on
Machine Learning, pages 148–156, 1996.
[7] S. Haykin. Neural Networks: A Comprehensive
Foundation. Prentice Hall PTR, Upper Saddle River, NJ,
USA, 2nd edition, 1998.
[8] G. H. John and P. Langley. Estimating continuous
distributions in bayesian classifiers. In Proceedings of the
Eleventh Conference on Uncertainty in Artificial
Intelligence, UAI’95, pages 338–345, San Francisco, CA,
USA, 1995. Morgan Kaufmann Publishers Inc.
[9] C.-Y. Lin. Rouge: A package for automatic evaluation of
summaries. In M.-F. Moens and S. Szpakowicz, editors,
Text Summarization Branches Out: Proceedings of the
ACL-04 Workshop, pages 74–81, Barcelona, Spain, July
2004. Association for Computational Linguistics.
[10] E. Lloret and M. Palomar. Compendium: a text
Figure 4: Precision of the Summarization Systems summarisation tool for generating summaries of multiple
using ROUGE 2 purposes, domains, and genres. Natural Language
Engineering, FirstView:1–40, 2012.
[11] E. Lloret and M. Palomar. Text summarisation in progress:
a literature review. Artif. Intell. Rev., 37(1):1–41, Jan.
4. CONCLUSIONS AND LINES FOR FUR- 2012.
THER WORKS [12] A. Patel, T. Siddiqui, and U. S. Tiwary. A language
independent approach to multilingual text summarization.
Automatic summarization opens a wide number of possi- In Large Scale Semantic Access to Content (Text, Image,
bilities such as the efficient classification, retrieval and in- Video, and Sound), RIAO ’07, pages 123–132, Paris,
formation based compression of text documents. This pa- France, France, 2007. LE CENTRE DE HAUTES
per presents an assessment of the most widely used sentence ETUDES INTERNATIONALES D’INFORMATIQUE
scoring methods for text summarization. The results demon- DOCUMENTAIRE.
strate that a criterions choice of the set of automatic sen- [13] C. Silva and B. Ribeiro. The importance of stop word
removal on recall values in text categorization. In IJCNN
tence summarization methods provides better quality sum-
2003, volume 3, n/a, 2003.
maries and also greater processing efficiency. The proposed
system selects 554 more relevant sentences to the summaries,
194