Automatic Text Summarization Using Gensim Word2Vec and K-Means Clustering Algorithm
Automatic Text Summarization Using Gensim Word2Vec and K-Means Clustering Algorithm
Abstract—The significance of text summarization in the Nat- outcome. However, a universal strategy of the summarization
ural Language Processing (NLP) community has now expanded is not available yet.
because of the staggering increase in virtual textual materials. The summarized document reflects the important aspects
Text summary is the process created from one or multiple
texts which convey important insight in a little form of the of the large text. Different text summarization technique has
main text. Multiple text summarization technique assists to been implied over time. An extractive approach of summa-
pick indispensable points of the original texts reducing time rizing is to pick relevant sentences, paragraphs, etc. from the
and effort require reading the whole document. The question actual document and concatenate them towards a simplified
was approached from a different point of view, in a different form. The meaning of sentences is determined based on the
domain by using different concepts. Extractive and abstractive
are the two main methods of summing up text. Though extractive numerical and linguistic characteristics of sentences [2]. On
summary is primarily concerned with what summary content the other hand, abstractive text summarization systems create
the frequency of words, phrases, and sentences from the original new sentences, likely rephrasing by using terms, not in the
document should be used. This research proposes a sentence original document [3].
based clustering algorithm (K-Means) for a single document. The purpose of that research is to summarize the sin-
For feature extraction, we have used Gensim word2vec which is
intended to automatically extract semantic topics from documents gle document through sentence based model using clusters,
in the most efficient way possible. whereas using Gensim word2vec for features extraction for
Index Terms—Text summarization, Extractive, Single Docu- the sentence-based model evaluates for figuring out the main
ment, NLP, Gensim, Word2Vec, K-Means. ideas through all the sentences in the text.
The remainder of the report is sorted as follows. Part II
I. I NTRODUCTION reflects the literature review in the document overview the
The subfield of text summarization has increased over sector, part III describes the proposed model, the K-Means
the half-century past. DR Radev [1] a text generated from clustering model and Gensim Word2Vec, part IV presents the
one or even more texts conveying vital information in the result analysis and evaluation finally in part V the conclusion
actual document, not more than half of the actual document has been portrayed including future ideas.
and generally less than that. Redundant texts exist in these
daily generated documents and the size of the documents is II. L ITERATURE R EVIEW
enlarging bit by bit. Rafael Ferreira et al. [4] addressed the method of extractive
It’s convenient for people to summarize the document text summarization using various sentence scoring method.
and bringing out the implicated meaning of that particular The proposed model is based on tokenizing the words and
document whereas the machine can’t resolve the same problem scoring the words to identify the importance of the document.
as efficiently as expected that is why various methods of They have taken CNN, Blog Summarization and SUMMAC
summarization have been tested to bring out the best possible these three types of datasets to test the algorithm.
284
E. K-Means CLUSTERING mean similarity with the Word2Vec model and the appearance
Clusters means a set of aggregated data points having certain of numbers and nouns. For each number and noun, 1 and 0.25
similarities. K-Means is an iterative algorithm where it divides will be added with the mean similarity of each sentence [16].
the dataset into distinct clusters keeping each data points in From that cluster of sentences, this model in pick n sentences
one group. K-Means aim is to reduce the square distance (n is the number of one-third of total sentences). By joining
summation between data points and their respective cluster these n sentences sorted based on their appearance on the text,
centers. the summary will be generated.
Algorithmic representation of K-Means [15]: IV. R ESULT AND E VALUATION
Let M = {m1,m2,m3,. . . . . . ..,mn} be the Data points collection
There are multiple ways to compare two texts. One of them
and V = {v1,v2,. . . . . . .,vc} are the centers.
is BLEU Score which has been used in this model [12]. BLEU
1) Select Cluster Centers ‘c’ by random selection. has been chosen because it is very easy to implement. It gives
2) The difference between individual data points and clus- a result between 0 and 1 where 1 is the best similarity and
ter centers is determined. 0 is the lowest similarity. The generated summary and the
3) Allocate the cluster center data point with a minimum original summary of the article have been compared using
distance from the cluster center of all cluster centers. BLEU. The maximum score from ten iterations has chosen as
4) Calculate the new center of clusters using: a BLEU score. Table 1, 2, 3, 4 and 5 represents the result of
ci a few business, entertainment, politics, sports and tech articles
(1) X
vi = mi (2) summaries.
(ci )
j=1
TABLE III
BLEU S CORE OF POLITICS ARTICLES
BLEU
BLEU BLEU BLEU
K Score
Politics Score Scores Scores
values for Cumulative
Fig. 1. The Elbow with k=2 in Business Doc 465 Doc for for for
With 2 BLEU
Number 1 3 4
Elbow
gram gram gram
gram
G. Summary Extraction 57 3 0.726 0.698 0.684 0.669 0.694
172 3 0.770 0.749 0.739 0.724 0.746
Finally, the cluster having the maximum number of sen- 246 4 0.693 0.653 0.633 0.614 0.649
tences has been selected as the higher frequency in cluster 318 2 0.566 0.536 0.522 0.508 0.533
indicates the most valuable sentences of the text. All the 360 4 0.527 0.465 0.439 0.420 0.463
sentences belong to that cluster have been scored based on
285
TABLE IV as the textual data is increasing day by day. The proposed
BLEU S CORE OF SPORTS ARTICLES model introduces Gensim Word2Vec with the combination
BLEU BLEU BLEU BLEU of the K-Means clustering algorithm and some new sentence
K
Sports
values
Scores Scores Scores Scores
Cumulative scoring procedure which enables a new dimension of research
Doc for for for for in text summarization. In this model, all the sentences were
With BLEU
Number 1 2 3 4
Elbow clustered using the K-Means clustering algorithm. Sentence
gram gram gram gram
1 4 0.667 0.618 0.598 0.577 0.615 scoring algorithm rates a sentence based on the occurrence
211 2 0.669 0.621 0.600 0.578 0.617 of numerical values and nouns. These techniques were imple-
256 3 0.741 0.728 0.716 0.700 0.721
352 2 0.719 0.690 0.677 0.666 0.688 mented on BBC news article datasets. The proposed model
378 2 0.567 0.523 0.504 0.486 0.520 showed the best performance on the business articles because
the business article contains more numerical values and the
TABLE V sentence scoring algorithm gives priority to numerical values.
BLEU S CORE OF TECH ARTICLES In the future, the same idea can be also implemented on the
BLEU BLEU BLEU BLEU extractive based multiple text document.
K
Tech Scores Score Scores Scores
values Cumulative R EFERENCES
Doc for for for for
With BLEU
Number 1 2 3 4 [1] D. Radev, “Centroid-based summarization of multiple documents: sen-
Elbow
gram gram gram gram tence extration, utility-based evalutation, and user studies,” in Proc.
74 3 0.755 0.705 0.676 0.650 0.696 ACL/NAAL Workshop on Summarization, Seattle, WA.(2000), 2000.
91 6 0.481 0.443 0.425 0.406 0.439 [2] V. Gupta and G. S. Lehal, “A survey of text summarization extractive
152 3 0.744 0.711 0.693 0.675 0.705 techniques,” Journal of emerging technologies in web intelligence, vol. 2,
226 3 0.670 0.596 0.560 0.532 0.589 no. 3, pp. 258–268, 2010.
297 3 0.557 0.481 0.451 0.427 0.479 [3] R. Paulus, C. Xiong, and R. Socher, “A deep reinforced model for
abstractive summarization,” arXiv preprint arXiv:1705.04304, 2017.
[4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva, F. Freitas,
Fig.2 shows the maximum, minimum and average BLEU G. D. Cavalcanti, R. Lima, S. J. Simske, and L. Favaro, “Assessing
sentence scoring techniques for extractive text summarization,” Expert
score of each category of the news article. Among all cate- systems with applications, vol. 40, no. 14, pp. 5755–5764, 2013.
gories, the model worked better for the business articles as [5] J. Kupiec, J. Pedersen, and F. Chen, “A trainable document summarizer,”
business articles contain more numerical values than other in Proceedings of the 18th annual international ACM SIGIR conference
on Research and development in information retrieval, 1995, pp. 68–73.
categories and numerical values got much priority in this [6] S. A. Anam, A. M. Rahman, N. N. Saleheen, and H. Arif, “Automatic
model. text summarization using fuzzy c-means clustering,” in 2018 Joint 7th
International Conference on Informatics, Electronics & Vision (ICIEV)
and 2018 2nd International Conference on Imaging, Vision & Pattern
Recognition (icIVPR). IEEE, 2018, pp. 180–184.
[7] D. Das and A. Martins, “A survey on automatic text summarization.
literature survey for language and statistics,” II Course at CMU, 2007.
[8] B. Li, A. Drozd, Y. Guo, T. Liu, S. Matsuoka, and X. Du, “Scaling
word2vec on big corpus,” Data Science and Engineering, vol. 4, no. 2,
pp. 157–175, 2019.
[9] R. A. Garcı́a-Hernández, R. Montiel, Y. Ledeneva, E. Rendón, A. Gel-
bukh, and R. Cruz, “Text summarization by sentence extraction using
unsupervised learning,” in Mexican International Conference on Artifi-
cial Intelligence. Springer, 2008, pp. 133–143.
[10] J. C. Dunn, “A fuzzy relative of the isodata process and its use in
detecting compact well-separated clusters,” 1973.
[11] D. Greene and P. Cunningham, “Practical solutions to the problem
of diagonal dominance in kernel document clustering,” in Proc. 23rd
Fig. 2. Summary score comparison between news article categories International Conference on Machine learning (ICML’06). ACM Press,
2006, pp. 377–384.
Text summarization accuracy may vary depending on the [12] R. Khan, Y. Qian, and S. Naeem, “Extractive based text summarization
using k-means and tf-idf,” International Journal of Information Engi-
type of dataset. A similar type of approach has been applied by neering and Electronic Business, vol. 11, no. 3, p. 33, 2019.
R. Khan, Y. Qian, and S. Naeem [12] where they have used the [13] D. Karani, “”Introduction to Word Embedding and Word2Vec”,”
TF-IDF score instead of Gensim Word2Vec. In our research https://towardsdatascience.com/introduction-to-word-embedding-and-
word2vec-652d0c2060fa, Sep 1, 2018.
we have used BBC news articles whereas they have worked on [14] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling
a different dataset, their highest BLEU score was 0.503984. with Large Corpora,” in Proceedings of the LREC 2010 Workshop on
On the other hand, our highest BLEU score was 0.894 for New Challenges for NLP Frameworks. Valletta, Malta: ELRA, May
2010, pp. 45–50, http://is.muni.cz/publication/884893/en.
the business atricle. From the data, it is quite obvious that [15] J. P. Ortega, M. Del, R. B. Rojas, and M. J. Somodevilla, “Research
if we use a different sentence scoring algorithm and Gensim issues on k-means algorithm: An experimental trial using matlab,” in
Word2Vec instead of TF-IDF it shows a better score. CEUR workshop proceedings: semantic web and new technologies,
2009, pp. 83–96.
V. C ONCLUSION [16] M. M. Haider, “”Sentence Scoring Based on Noun and Numeri-
cal Values”,” https://towardsdatascience.com/sentence-scoring-based-on-
Text summarization is one of the most renowned buzz noun-and-numerical-values-d7ac4dd787f2, Feb 1, 2020.
words in the area of research in natural language processing
286