Summarization

Uploaded by

abhimyvkn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views5 pages

Summarization

Uploaded by

abhimyvkn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/299863690

Automatic Text Document Summarization Based on Machine Learning

Conference Paper · September 2015

DOI: 10.1145/2682571.2797099

CITATIONS READS
13 1,359

7 authors, including:

Gabriel Augusto da Silva Rafael Ferreira

University Center of FEI Universidade Federal Rural de Pernambuco
16 PUBLICATIONS 45 CITATIONS 164 PUBLICATIONS 1,651 CITATIONS

SEE PROFILE SEE PROFILE

Rafael Dueire Lins Luciano Cabral

Federal University of Pernambuco Instituto Federal de Educação, Ciência e Tecnologia de Pernambuco (IFPE)
300 PUBLICATIONS 3,654 CITATIONS 22 PUBLICATIONS 572 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

DIB: Digital Image Binarization Platform View project

3D Printing Software View project

All content following this page was uploaded by Rafael Ferreira on 18 April 2016.

The user has requested enhancement of the downloaded file.

Automatic Text Document Summarization Based on
Machine Learning

Gabriel Silva , Rafael Lins, Luciano

Rafael Ferreira Cabral, Hilário Oliveira
UFRPE/UFPE, Recife, PE, UFPE, Recife, PE, Brazil
Brazil {rdl, htao}@cin.ufpe.br
{gfps, rflm}@cin.ufpe.br
Steven J. Simske Marcelo Riss
Hewlett-Packard Labs. Hewlett-Packard Brazil
Fort Collins, CO 80528, USA Porto Alegre, RS, Brazil
[email protected] [email protected]

ABSTRACT even be seen as a way to “compress” information [12]. TS

The need for automatic generation of summaries gained im- platforms may receive one or more documents as input to
portance with the unprecedented volume of information avail- generate a summary. Such technique is classified as extrac-
able in the Internet. Automatic systems based on extractive tive when the summary is formed by sentences of the original
summarization techniques select the most significant sen- document, or abstractive, when summaries modify the orig-
tences of one or more texts to generate a summary. This inal sentences chosen to yield a better quality summary [11].
article makes use of Machine Learning techniques to assess In general, abstractive summarization may be seen as a step
the quality of the twenty most referenced strategies used in further ahead of extractive summarization and research in
extractive summarization, integrating them in a tool. Quan- that area may be considered in the very beginning. The
titative and qualitative aspects were considered in such as- extractive summarization techniques (RTS) select the sen-
sessment demonstrating the validity of the proposed scheme. tences with the highest score from the original document
The experiments were performed on the CNN-corpus, pos- based on a set of criteria. The Extractive Summarization
sibly the largest and most suitable test corpus today for methods are better consolidated and may be considered ef-
benchmarking extractive summarization. ficient in the automatic generation of summaries [12, 11, 4].
Summaries may also be classified as generic or query de-
pendent or driven. Generic summaries analyze the text as a
Categories and Subject Descriptors whole without prioritizing any aspect. On the other hand,
I.2.7 [Natural Language Processing]: Text analysis. query dependant or driven summaries look at the text try-
ing to find sentences that may answer a query from the user.
Text summarization may also be seen as a text compression
General Terms strategy. The vertical compression rate of a summary may
Algorithms, Experimentation be defined as the ratio between the number of sentences in
the original document and the number of sentences in the
Keywords summary. Another possibility is horizontal sentence com-
pression in which each sentence may be summarized by re-
Text Summarization; Extractive features; Sentence Scoring moving non-essential information. In this case the compres-
Methods sion rate is measured by the ratio between the number of
words in the original document and the number of words in
1. INTRODUCTION the summary. Both compression rates are important factors
Automatic document summarization is a research area that influence the overall quality and purpose of the sum-
that was born in the early 1950’s. Recently, with the perva- mary. This paper focuses exclusively in extractive vertical
siveness of the Internet and the fast growing number of text summarization.
documents the search for efficient automated systems for Extractive text summarization techniques are split into
Text Summarization (TS) has gained importance and may three categories [4]: word-based, sentence-based, and graph-
based scoring methods. In the methods based on word
scoring each word receives a score and the weight of each
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
sentence is the sum of all scores of its constituent words.
for profit or commercial advantage and that copies bear this notice and the full cita- Sentence-based Scoring analyzes the features of the sentence
tion on the first page. Copyrights for components of this work owned by others than and its relation to the text. Cue-phrases (such as “it is
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- important”, “in summary”, etc.), resemblance to the title,
publish, to post on servers or to redistribute to lists, requires prior specific permission and sentence position are examples of sentence-based scoring
and/or a fee. Request permissions from [email protected]. techniques. Finally, in graph-based methods, the score of a
DocEng’15, September 8-11, 2015, Lausanne, Switzerland.
c 2015 ACM. ISBN 978-1-4503-3307-8/15/09 ...$15.00. sentence reflects some relationship among sentences. When
DOI: http://dx.doi.org/10.1145/2682571.2797099. a word or sentence refers to another one, an edge is gener-

191
ated with a weight between them. The sum of the weights of moval and stemming. Each text paragraph is numbered, as
a sentence is its score. This article analyzes 15 sentence scor- well as each of their sentences. Then, sentence segmenta-
ing methods, and some variation of them, widely used and tion is performed by Stanford CoreNLP1 . Stop words [5] are
referenced in the literature applied to document summariza- removed since they are considered unimportant and can in-
tion in the last 10 years. The scoring methods comprise the dicate noise. Stop Words are predefined and stored in an
feature vector that will be used to train the classifier and array that is used for comparison with the words in the doc-
to rank sentences, totaling 20 features. The key point in ument. Word Stemming [13] converts each word in its root
this paper is to use Machine Learning techniques to analyze form, removing its prefix and suffix of the specified word is
such features in a way to point out which of them better performed. After this stage the text is structured in XML
contribute to yield good quality summaries. and included in the XML file that corresponds to the news
Quantitative and qualitative strategies are used here as article. As the focus here is in the text part of the docu-
ways of assessing the quality of summaries. The quantitative ment for summarization all other XML-file attributes will
assessment was performed using ROUGE (Recall-Oriented no longer be addressed in this paper.
Understudy for Gisting Evaluation) [9], a measure widely
accepted for such a purpose. In addition, another quan- 3.2 Feature Extraction
titative analysis was performed by three people who ana- The XML document after preprocessing is represented by
lyzed each original text and generated summaries following the set D = {S1 , S2 , ..., Si }, where Si is a sentence in the
a methodology that is better described below. The qualita- document D. The preprocessed sentences are subjected to
tive assessment is made by counting the number of sentences the feature extraction process so that a feature vector is
selected by the system that coincides with the sentences se- generated for each sentence, Vi = {F1 , F2 , ..., Fi }, where Vi
lected by all the three human users. The results obtained is the feature vector of each sentence Si . As already men-
shows the effectiveness of the proposed method. It selects tioned extractive summarization use three scoring strategies
two times more relevant sentences to compose the summary. [4]: (i) Word : it assigns scores to the most important words;
Moreover, it achieves results 71% better in evaluation using (ii) Sentence: it accounts for features of sentences, such as
ROUGE 2 metric. its position in the document, similarity to the title, etc; (iii)
Graphic: it uses the relationship between words and sen-
tences.
2. THE CNN CORPUS Table 1 shows the features analyzed in this work and
The CNN corpus developed by Lins and his colleagues their kind of scoring. They correspond to the most widely
[9] consists of news texts extracted from the CNN website acknowledged techniques for extractive summarization re-
(www.cnn.com). The main advantage of this test corpus ported in the literature.
rests not only on the high quality of the writing using gram-
matically correct standard English to report on general in-
terest subjects, but each of the texts of the news article Table 1: Number of summaries sentences into gold stan-
is provided with its highlights, which consists of a 3 to 5 dard.
sentences long summary written by the original author(s). Feature Name of Extractive Type of
The highlights were the basis for the development the gold Summarization Strategy Scoring
standard, which was obtained by the injective mapping of F01 Aggregate Similarity Graph
each of the sentences in the highlights onto the original sen- F02 Bushy Path Graph
tences of the text. Such mapping process was performed F03 Centrality Sentence
by three different people. The gold standard was formed F04 Heterogeneous Graph Graph
with most voted mapped sentences chosen. A very high de- F05 Text Rank Graph
gree of consistency in sentence selection was observed. The F06 Cue-Phrase Sentence
CNN-corpus is possibly the largest existing corpus for bench- F07 Numerical Data Sentence
marking extractive summarization techniques. The current F08 Position Paragraph Sentence
version has 400 documents, written in the English language, F09 Position Text Sentence
totaling 13,228 sentences, of which 1,471 were selected for F10 Resemblance Title Sentence
the gold standards, representing an average compression rate
F11 Sentence Length Sentence
of 90%.
F12 Sentence Position in Paragraph Sentence
F13 Sentence Position in Text Sentence
3. THE SYSTEM F14 Proper-Noun Word
The steps for creating the methodology for obtaining the F15 Co-Occurrence Bleu Word
extractive summaries are presented in the following sections. F16 Lexical Similarity Word
F17 Co-Occurrence N-gram Word
3.1 Text pre-processing F18 TF/IDF Word
The news articles obtained from the CNN website must F19 Upper Case Word
be carefully chosen in order to contain only text, thus news F20 Word Frequency Word
articles with figures, videos, tables and other multi-media
elements are discarded. Besides that, the article must be
“complete” with the text, highlights, title, author(s), sub-
ject area, etc. All such data is inserted in a XML file.
The text part of the document in then processed for para-
1
graph segmentation, sentence segmentation, stop word re- http://nlp.stanford.edu/software/corenlp.shtml

192
3.3 Classification model
The steps for creating the classification model used to se-
lect the sentences that will compose the summary are de-
tailed here.
The first step has the purpose of reducing the problems
inherent to feature extraction of each sentence. First, the
feature vectors that have missing information and outliers
(when all features reach the maximum value) are eliminated.
Another problem addressed here is basis unbalance, when-
ever there is a large disparity in the number of data of the
training classes, the problem known in the literature as a
problem of class imbalance arises. Classification models that
are optimized with respect to overall accuracy tend to create
trivial models that almost always predict the majority class.
The algorithm chosen to address the problem of balanc- Figure 1: Selected Features
ing was SMOTE [3]. The principle of the algorithm is to
create artificial data based on spatial features between ex-
amples of the minority class. Specifically, for a subset (whole Text Summarizer (OTS), Text Compactor (TC), Free Sum-
minority class), consider the k nearest neighbors for each in- marizer (FS), Smmry (SUMM), Web Summarizer (WEB),
stance belonging to k for some integer value. Depending Intellexer Summarizer (INT)3 , Compendium (COMP) [10].
on the amount of oversampling chosen k nearest neighbors Figure 2 presents the proposed summarization method,
are randomly chosen. Synthetic samples are generated as showing the number of correct sentences chosen from the
follows: Calculate the Euclidean distance between the vec- human selected sentences that form the gold standard. This
tor of points (samples) in question and its nearest neighbor. experiment used 400 texts from CNN news.
Multiply this distance by a random number between zero
and one and add the vector points into consideration. This
causes the selection of a point along a line between the two
points selected. This approach effectively makes the region
the minority class becomes harder to become more general
[3].
Then, the system perform a feature selection, which is
an important tool for reducing the dimensionality of the
vectors, as some features contribute to decreasing the effi-
ciency of the classifier. Another contribution of this study
is to identify which of the 20 most used features in the last
10 years in the problems of extractive summarization con-
tribute effectively to a good performance of classifiers. The
experiment was conducted under the corpus of 400 news
CNN-English.
The experiments were performed with selection algorithms
of WEKA2 , three were chosen and applied on the balanced
basis for defining the best attributes of the vector. Below,
the methods of selection of attributes are listed: (i) CFS
Subset Evaluator: Evaluates the worth of a subset of at-
tributes by considering the individual predictive ability of
each feature along with the degree of redundancy between Figure 2: Evaluation of the classifiers for summa-
them; (ii) Information Gain Evaluator: Evaluates the rization
worth of an attribute by measuring the information gain
with respect to the class; (iii) SVM Attribute: Evaluates The classifiers were tested with variations of parameters
the worth of an attribute by using an SVM classifier. The with and without adjustment and balancing of the base.
top five characteristics indicated by the selection methods The technique chosen to validate the models was the Cross-
were chosen. Figure 1 shows the profile of the selected fea- Validation. The tests performed with the unbalanced ba-
tures. sis yielded an accuracy of 52% and balanced with the base
The selected features demonstrate the prevalence of lan- yielded 70% accuracy. The Naive Bayes classifier achieve the
guage independent features such as the position of text, best result in all cases. In qualitative evaluation it reach 969
TF/IDF and similarity. This allows summarization texts and 1082 correct sentences selected to the summary on un-
in different languages. balanced and balanced cases respectively. In the first case
Five classifiers were tested using the WEKA platform: Naive Bayes outperforms in 7.42% the second place (Ada
Naive Bayes [8], MLP [7], SVM [7], KNN [1], Ada Boost Boost) and it selects the same number of important sen-
[6], and Random Forest [2]. The results of the classifiers tences of KNN on balanced case.
were compared with seven summarization systems: Open 3
libots.sourceforge.net, www.textcompactor.com, freesum-
marizer.com, smmry.com, www.websummarizer.com, sum-
2
http://www.cs.waikato.ac.nz/ml/weka/ marizer.intellexer.com

193
Figure 3 and 4 presents the comparison of the Naive Bayes which means an improvement of more than 100%, in rela-
classifier results against the seven summarization systems. tion to the best tool found in literature. It was also evident
The superiority of the proposed method was proved on both that the balancing judgment on the basis of examples yields
evaluation. In the qualitative assessment the proposed method gains in the performance of the sentence selection system.
reach 1082 correct sentences selected, which means an im- The next step is the validation of the experiments in other
provement of more than 100% in relation to Text Compactor summarization test corpora for texts other than news arti-
the best tool found in the literature. In number it obtained cles. Although the CNN-corpus may possibly be the largest
554 more correct sentences. Using ROUGE the Naive Bayes and best test corpus for assessing news articles today, the
Classifier achieve a result 61.3% better than Web Summa- authors of this paper are promoting an effort to double its
rizer, the second place. The proposed method reach 71% size in the near future, allowing even better testing capabil-
of accuracy while WEB obtained 44%. These results con- ities.
firms the hypothesis that using machine learning technics
improves the text summarization results. 5. ACKNOWLEDGMENTS
The research results reported in this paper have been
partly funded by a R&D project between Hewlett-Packard
do Brazil and UFPE originated from tax exemption (IPI -
Law number 8.248, of 1991 and later updates).

6. REFERENCES
[1] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based
learning algorithms. Mach. Learn., 6(1):37–66, Jan. 1991.
[2] L. Breiman. Random forests. Mach. Learn., 45(1):5–32,
Oct. 2001.
[3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
Kegelmeyer. Smote: Synthetic minority over-sampling
technique. J. Artif. Int. Res., 16(1):321–357, June 2002.
[4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva,
F. Freitas, G. D. Cavalcanti, R. Lima, S. J. Simske, and
L. Favaro. Assessing sentence scoring techniques for
extractive text summarization. Expert Systems with
Figure 3: Evaluation of the summarization systems Applications, 40(14):5755 – 5764, 2013.
[5] W. B. Frakes and R. Baeza-Yates, editors. Information
Retrieval: Data Structures and Algorithms. Prentice-Hall,
Inc., Upper Saddle River, NJ, USA, 1992.
[6] Y. Freund and R. E. Schapire. Experiments with a new
boosting algorithm. In International Conference on
Machine Learning, pages 148–156, 1996.
[7] S. Haykin. Neural Networks: A Comprehensive
Foundation. Prentice Hall PTR, Upper Saddle River, NJ,
USA, 2nd edition, 1998.
[8] G. H. John and P. Langley. Estimating continuous
distributions in bayesian classifiers. In Proceedings of the
Eleventh Conference on Uncertainty in Artificial
Intelligence, UAI’95, pages 338–345, San Francisco, CA,
USA, 1995. Morgan Kaufmann Publishers Inc.
[9] C.-Y. Lin. Rouge: A package for automatic evaluation of
summaries. In M.-F. Moens and S. Szpakowicz, editors,
Text Summarization Branches Out: Proceedings of the
ACL-04 Workshop, pages 74–81, Barcelona, Spain, July
2004. Association for Computational Linguistics.
[10] E. Lloret and M. Palomar. Compendium: a text
Figure 4: Precision of the Summarization Systems summarisation tool for generating summaries of multiple
using ROUGE 2 purposes, domains, and genres. Natural Language
Engineering, FirstView:1–40, 2012.
[11] E. Lloret and M. Palomar. Text summarisation in progress:
a literature review. Artif. Intell. Rev., 37(1):1–41, Jan.
4. CONCLUSIONS AND LINES FOR FUR- 2012.
THER WORKS [12] A. Patel, T. Siddiqui, and U. S. Tiwary. A language
independent approach to multilingual text summarization.
Automatic summarization opens a wide number of possi- In Large Scale Semantic Access to Content (Text, Image,
bilities such as the efficient classification, retrieval and in- Video, and Sound), RIAO ’07, pages 123–132, Paris,
formation based compression of text documents. This pa- France, France, 2007. LE CENTRE DE HAUTES
per presents an assessment of the most widely used sentence ETUDES INTERNATIONALES D’INFORMATIQUE
scoring methods for text summarization. The results demon- DOCUMENTAIRE.
strate that a criterions choice of the set of automatic sen- [13] C. Silva and B. Ribeiro. The importance of stop word
removal on recall values in text categorization. In IJCNN
tence summarization methods provides better quality sum-
2003, volume 3, n/a, 2003.
maries and also greater processing efficiency. The proposed
system selects 554 more relevant sentences to the summaries,