0% found this document useful (0 votes)
82 views

Automatic Text Summarization Using Gensim Word2Vec and K-Means Clustering Algorithm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Automatic Text Summarization Using Gensim Word2Vec and K-Means Clustering Algorithm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2020 IEEE Region 10 Symposium (TENSYMP), 5-7 June 2020, Dhaka, Bangladesh

Automatic Text Summarization Using Gensim


Word2Vec and K-Means Clustering Algorithm
Mofiz Mojib Haider Md. Arman Hossin
Department of Computer Science and Engineering Department of Computer Science and Engineering
BRAC University BRAC University
Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected]

Hasibur Rashid Mahi Hossain Arif


Department of Computer Science and Engineering Department of Computer Science and Engineering
BRAC University BRAC University
Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected]

Abstract—The significance of text summarization in the Nat- outcome. However, a universal strategy of the summarization
ural Language Processing (NLP) community has now expanded is not available yet.
because of the staggering increase in virtual textual materials. The summarized document reflects the important aspects
Text summary is the process created from one or multiple
texts which convey important insight in a little form of the of the large text. Different text summarization technique has
main text. Multiple text summarization technique assists to been implied over time. An extractive approach of summa-
pick indispensable points of the original texts reducing time rizing is to pick relevant sentences, paragraphs, etc. from the
and effort require reading the whole document. The question actual document and concatenate them towards a simplified
was approached from a different point of view, in a different form. The meaning of sentences is determined based on the
domain by using different concepts. Extractive and abstractive
are the two main methods of summing up text. Though extractive numerical and linguistic characteristics of sentences [2]. On
summary is primarily concerned with what summary content the other hand, abstractive text summarization systems create
the frequency of words, phrases, and sentences from the original new sentences, likely rephrasing by using terms, not in the
document should be used. This research proposes a sentence original document [3].
based clustering algorithm (K-Means) for a single document. The purpose of that research is to summarize the sin-
For feature extraction, we have used Gensim word2vec which is
intended to automatically extract semantic topics from documents gle document through sentence based model using clusters,
in the most efficient way possible. whereas using Gensim word2vec for features extraction for
Index Terms—Text summarization, Extractive, Single Docu- the sentence-based model evaluates for figuring out the main
ment, NLP, Gensim, Word2Vec, K-Means. ideas through all the sentences in the text.
The remainder of the report is sorted as follows. Part II
I. I NTRODUCTION reflects the literature review in the document overview the
The subfield of text summarization has increased over sector, part III describes the proposed model, the K-Means
the half-century past. DR Radev [1] a text generated from clustering model and Gensim Word2Vec, part IV presents the
one or even more texts conveying vital information in the result analysis and evaluation finally in part V the conclusion
actual document, not more than half of the actual document has been portrayed including future ideas.
and generally less than that. Redundant texts exist in these
daily generated documents and the size of the documents is II. L ITERATURE R EVIEW
enlarging bit by bit. Rafael Ferreira et al. [4] addressed the method of extractive
It’s convenient for people to summarize the document text summarization using various sentence scoring method.
and bringing out the implicated meaning of that particular The proposed model is based on tokenizing the words and
document whereas the machine can’t resolve the same problem scoring the words to identify the importance of the document.
as efficiently as expected that is why various methods of They have taken CNN, Blog Summarization and SUMMAC
summarization have been tested to bring out the best possible these three types of datasets to test the algorithm.

978-1-7281-7366-5/20/$31.00 ©2020 IEEE


283
Kupiec et al. [5] constructed a trainable summarization A. Dataset
program based on statistical classification. They build a clas-
BBC news article dataset [11] contains 2225 news articles
sification method that calculates a given sentence’s likelihood
along with sample summaries for each article which are di-
using Bayes’s rule. They used Frequency-Keyword, Tittle-
vided into 5 categories (business, entertainment, politics, sport,
Keyword, and Location as a heuristic.
and tech). Randomly 10 news articles from each category total
Anam et al. [6] suggested a model based on sentences using
50 news articles have been chosen for processing.
Fuzzy C-Means Clustering Algorithm. FCM uses fuzzy sets
and fuzzy subset matrix to predominate the relation among
various cluster elements. B. Preprocessing
Das, D. and Martins, A. F. [7] showed a single document Preprocessing is required to convert the data into a machine-
summarization and multi-document summarization using ex- readable form of the vector.
tractive and abstractive text summarization approach. Where
there are various algorithms has been applied like Na¨ıve- 1) Sentence Tokenization: It is the process to split the
Bayes method, Rich features and Decision trees, Hidden text into sentences [12]. Sentence tokenizer from NLTK
Markov methods and Long Linear models and manifest the library python was used to split the sentences.
performance depending on data set was implied. 2) Remove Special Character: It is possible that text may
Romain Paulus et al. [3] proposed a deeply strengthened contain some unnecessary characters. All those unnec-
model for summarizing abstract texts. Neural network model essary characters have been removed.
with a novel intra-attention has been used over input. Basic 3) Word Tokenization: Each of the sentences of the article
word prediction blends with reinforcement learning of training has been split into words by using word spaces [12].
in global sequence prediction to make the description more 4) Removal of stop words: Stop words are those words that
legible. will be ignored while processing the text. All the words
from the text which are considered as stop words have
Bofang li et al. [8] they heuristically develop a Word2Vec
been removed [12].
variant to ensure that each pair of terms comprise a non-
5) Duplicate Word Removal: Words from each sentence
based word and a universally sampled descriptive term. They
that occur more than once have been removed except
”freeze” the batch context and only adjust the insignificant
keeping it once.
part to resolve conflicts.
6) Lemmatization: It’s a method of finding the root of every
Rene´ Arnulfo Grc´ıa-Hernandez´ et al. [9] they suggest an
word. The text’s words have all been lemmatized [12].
automated text review solution using an unsupervised learning
algorithm by phrase extraction, independent of the language
and domain. Their theory is that an algorithm unchecked will C. Word2Vec
help to bring these ideas (sentences) together. In neural network one of the most common word embedding
techniques is Word2Vec. First of all, a vector representation for
III. P ROPOSED M ODEL each word at a certain length where the vector would consist of
The proposed model of this paper is mainly a sentence zeros except for the element representing the words. The words
based clustering approach to summarization a news article it that have similar meaning take a closer spatial position [13].
is demonstrated that sentence based models are more efficient (X·Y )
than graph and word-based modes [10]. At the very primary sin(X, Y ) = cos(θ) = (1)
||X||||Y ||
stage, a news article has been selected from the dataset and
undergoes numerous pre-processing procedures. During pre- It can be implemented in two ways, one is Skip Gram
processing, the model will perform sentence tokenize, remove and another one is CBOW (Common Bag of Words). In our
special characters, word tokenize, duplicate word remove and research, Skip Gram techniques have used as it works better
finally lemmatization to get the root word. for the small amount data and for the words those are not that
After completion of the preprocessing, the model will much common. As this process gives input the word, it gets
perform the feature extraction process to score each sentence output of probability distributions for each vector length for
of the text. The model used Gensim Word2Vec to generate every single word where the backpropagation method is used
a vector representation of the text. Then, the model has to deal with it.
distributed the vectorized sentence into k clusters based on the
clustering algorithm K-Means where the number of clusters is
D. Gensim Word2Vec
k. To determine the perfect value of k, this model has used
the Elbow method. Gensim is a very popular open-source library for unsuper-
Finally, the model will generate a summary by picking up vised learning implemented in python [14]. Gensim imple-
some important sentences from those clusters based on the ments Word2Vec based on Latent Dirichlet Allocation (LDA).
score of each sentence given by our processing algorithm. The Gensim Word2Vec has been used to generate a word vector
generated summary will be one-third of the given text. from the tokenized word list.

284
E. K-Means CLUSTERING mean similarity with the Word2Vec model and the appearance
Clusters means a set of aggregated data points having certain of numbers and nouns. For each number and noun, 1 and 0.25
similarities. K-Means is an iterative algorithm where it divides will be added with the mean similarity of each sentence [16].
the dataset into distinct clusters keeping each data points in From that cluster of sentences, this model in pick n sentences
one group. K-Means aim is to reduce the square distance (n is the number of one-third of total sentences). By joining
summation between data points and their respective cluster these n sentences sorted based on their appearance on the text,
centers. the summary will be generated.
Algorithmic representation of K-Means [15]: IV. R ESULT AND E VALUATION
Let M = {m1,m2,m3,. . . . . . ..,mn} be the Data points collection
There are multiple ways to compare two texts. One of them
and V = {v1,v2,. . . . . . .,vc} are the centers.
is BLEU Score which has been used in this model [12]. BLEU
1) Select Cluster Centers ‘c’ by random selection. has been chosen because it is very easy to implement. It gives
2) The difference between individual data points and clus- a result between 0 and 1 where 1 is the best similarity and
ter centers is determined. 0 is the lowest similarity. The generated summary and the
3) Allocate the cluster center data point with a minimum original summary of the article have been compared using
distance from the cluster center of all cluster centers. BLEU. The maximum score from ten iterations has chosen as
4) Calculate the new center of clusters using: a BLEU score. Table 1, 2, 3, 4 and 5 represents the result of
ci a few business, entertainment, politics, sports and tech articles
(1) X
vi = mi (2) summaries.
(ci )
j=1

5) Recalculate the distance between each data point and


newly obtained cluster centers. TABLE I
BLEU S CORE OF B USINESS ARTICLES
6) If there is no reassigned data point then, otherwise repeat
step no 3. Business
K BLEU BLEU BLEU BLEU
values Score Scores Scores Scores Cumulative
Doc
F. Elbow method to find K With for for for for BLEU
Number
Elbow 1 gram 2 gram 3 gram 4 gram
It is very much essential to determine the perfect value of k 79 5 0.573 0.530 0.512 0.495 0.528
to get the best outcome from the K-Means algorithm. Elbow 101 3 0.741 0.691 0.670 0.652 0.689
method is one of the most popular methods to determine the 133 3 0.752 0.733 0.721 0.707 0.728
465 2 0.894 0.894 0.894 0.894 0.894
value of k which represents the number of clusters will be 499 4 0.648 0.593 0.570 0.548 0.590
used in this model. In this model, the iterative range for k is
1 to 9.
The steps of Elbow method [12]: TABLE II
• K starts from 1 to 9 BLEU S CORE OF ENTERTAINMENT ARTICLES
• Increase k by 1 BLEU BLEU BLEU BLEU
Entertain- K
• Measure the distortion Scores Scores Scores Scores
ment values Cumulative
• The point after which the distortion begin to decrease in for for for for
Doc With BLEU
1 2 3 4
a linear line. Number Elbow
gram gram gram gram
112 3 0.669 0.660 0.658 0.653 0.660
205 3 0.836 0.810 0.794 0.777 0.804
255 2 0.548 0.496 0.481 0.465 0.497
263 3 0.622 0.564 0.540 0.521 0.562
338 4 0.819 0.791 0.771 0.749 0.783

TABLE III
BLEU S CORE OF POLITICS ARTICLES

BLEU
BLEU BLEU BLEU
K Score
Politics Score Scores Scores
values for Cumulative
Fig. 1. The Elbow with k=2 in Business Doc 465 Doc for for for
With 2 BLEU
Number 1 3 4
Elbow
gram gram gram
gram
G. Summary Extraction 57 3 0.726 0.698 0.684 0.669 0.694
172 3 0.770 0.749 0.739 0.724 0.746
Finally, the cluster having the maximum number of sen- 246 4 0.693 0.653 0.633 0.614 0.649
tences has been selected as the higher frequency in cluster 318 2 0.566 0.536 0.522 0.508 0.533
indicates the most valuable sentences of the text. All the 360 4 0.527 0.465 0.439 0.420 0.463
sentences belong to that cluster have been scored based on

285
TABLE IV as the textual data is increasing day by day. The proposed
BLEU S CORE OF SPORTS ARTICLES model introduces Gensim Word2Vec with the combination
BLEU BLEU BLEU BLEU of the K-Means clustering algorithm and some new sentence
K
Sports
values
Scores Scores Scores Scores
Cumulative scoring procedure which enables a new dimension of research
Doc for for for for in text summarization. In this model, all the sentences were
With BLEU
Number 1 2 3 4
Elbow clustered using the K-Means clustering algorithm. Sentence
gram gram gram gram
1 4 0.667 0.618 0.598 0.577 0.615 scoring algorithm rates a sentence based on the occurrence
211 2 0.669 0.621 0.600 0.578 0.617 of numerical values and nouns. These techniques were imple-
256 3 0.741 0.728 0.716 0.700 0.721
352 2 0.719 0.690 0.677 0.666 0.688 mented on BBC news article datasets. The proposed model
378 2 0.567 0.523 0.504 0.486 0.520 showed the best performance on the business articles because
the business article contains more numerical values and the
TABLE V sentence scoring algorithm gives priority to numerical values.
BLEU S CORE OF TECH ARTICLES In the future, the same idea can be also implemented on the
BLEU BLEU BLEU BLEU extractive based multiple text document.
K
Tech Scores Score Scores Scores
values Cumulative R EFERENCES
Doc for for for for
With BLEU
Number 1 2 3 4 [1] D. Radev, “Centroid-based summarization of multiple documents: sen-
Elbow
gram gram gram gram tence extration, utility-based evalutation, and user studies,” in Proc.
74 3 0.755 0.705 0.676 0.650 0.696 ACL/NAAL Workshop on Summarization, Seattle, WA.(2000), 2000.
91 6 0.481 0.443 0.425 0.406 0.439 [2] V. Gupta and G. S. Lehal, “A survey of text summarization extractive
152 3 0.744 0.711 0.693 0.675 0.705 techniques,” Journal of emerging technologies in web intelligence, vol. 2,
226 3 0.670 0.596 0.560 0.532 0.589 no. 3, pp. 258–268, 2010.
297 3 0.557 0.481 0.451 0.427 0.479 [3] R. Paulus, C. Xiong, and R. Socher, “A deep reinforced model for
abstractive summarization,” arXiv preprint arXiv:1705.04304, 2017.
[4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva, F. Freitas,
Fig.2 shows the maximum, minimum and average BLEU G. D. Cavalcanti, R. Lima, S. J. Simske, and L. Favaro, “Assessing
sentence scoring techniques for extractive text summarization,” Expert
score of each category of the news article. Among all cate- systems with applications, vol. 40, no. 14, pp. 5755–5764, 2013.
gories, the model worked better for the business articles as [5] J. Kupiec, J. Pedersen, and F. Chen, “A trainable document summarizer,”
business articles contain more numerical values than other in Proceedings of the 18th annual international ACM SIGIR conference
on Research and development in information retrieval, 1995, pp. 68–73.
categories and numerical values got much priority in this [6] S. A. Anam, A. M. Rahman, N. N. Saleheen, and H. Arif, “Automatic
model. text summarization using fuzzy c-means clustering,” in 2018 Joint 7th
International Conference on Informatics, Electronics & Vision (ICIEV)
and 2018 2nd International Conference on Imaging, Vision & Pattern
Recognition (icIVPR). IEEE, 2018, pp. 180–184.
[7] D. Das and A. Martins, “A survey on automatic text summarization.
literature survey for language and statistics,” II Course at CMU, 2007.
[8] B. Li, A. Drozd, Y. Guo, T. Liu, S. Matsuoka, and X. Du, “Scaling
word2vec on big corpus,” Data Science and Engineering, vol. 4, no. 2,
pp. 157–175, 2019.
[9] R. A. Garcı́a-Hernández, R. Montiel, Y. Ledeneva, E. Rendón, A. Gel-
bukh, and R. Cruz, “Text summarization by sentence extraction using
unsupervised learning,” in Mexican International Conference on Artifi-
cial Intelligence. Springer, 2008, pp. 133–143.
[10] J. C. Dunn, “A fuzzy relative of the isodata process and its use in
detecting compact well-separated clusters,” 1973.
[11] D. Greene and P. Cunningham, “Practical solutions to the problem
of diagonal dominance in kernel document clustering,” in Proc. 23rd
Fig. 2. Summary score comparison between news article categories International Conference on Machine learning (ICML’06). ACM Press,
2006, pp. 377–384.
Text summarization accuracy may vary depending on the [12] R. Khan, Y. Qian, and S. Naeem, “Extractive based text summarization
using k-means and tf-idf,” International Journal of Information Engi-
type of dataset. A similar type of approach has been applied by neering and Electronic Business, vol. 11, no. 3, p. 33, 2019.
R. Khan, Y. Qian, and S. Naeem [12] where they have used the [13] D. Karani, “”Introduction to Word Embedding and Word2Vec”,”
TF-IDF score instead of Gensim Word2Vec. In our research https://towardsdatascience.com/introduction-to-word-embedding-and-
word2vec-652d0c2060fa, Sep 1, 2018.
we have used BBC news articles whereas they have worked on [14] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling
a different dataset, their highest BLEU score was 0.503984. with Large Corpora,” in Proceedings of the LREC 2010 Workshop on
On the other hand, our highest BLEU score was 0.894 for New Challenges for NLP Frameworks. Valletta, Malta: ELRA, May
2010, pp. 45–50, http://is.muni.cz/publication/884893/en.
the business atricle. From the data, it is quite obvious that [15] J. P. Ortega, M. Del, R. B. Rojas, and M. J. Somodevilla, “Research
if we use a different sentence scoring algorithm and Gensim issues on k-means algorithm: An experimental trial using matlab,” in
Word2Vec instead of TF-IDF it shows a better score. CEUR workshop proceedings: semantic web and new technologies,
2009, pp. 83–96.
V. C ONCLUSION [16] M. M. Haider, “”Sentence Scoring Based on Noun and Numeri-
cal Values”,” https://towardsdatascience.com/sentence-scoring-based-on-
Text summarization is one of the most renowned buzz noun-and-numerical-values-d7ac4dd787f2, Feb 1, 2020.
words in the area of research in natural language processing

286

You might also like