0% found this document useful (0 votes)
58 views7 pages

A Survey of Topic Modeling in Text Mining

This document provides a summary of topic modeling methods in text mining. It discusses two main categories: 1) Methods of Topic Modeling, which includes Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Correlated Topic Model (CTM). 2) Topic Evolution Model, which considers how topics change over time and includes models like Topic Over Time and Dynamic Topic Models. The document provides a brief overview of each method and discusses their applications in text analysis and mining.

Uploaded by

Ahmad Fajar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views7 pages

A Survey of Topic Modeling in Text Mining

This document provides a summary of topic modeling methods in text mining. It discusses two main categories: 1) Methods of Topic Modeling, which includes Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Correlated Topic Model (CTM). 2) Topic Evolution Model, which considers how topics change over time and includes models like Topic Over Time and Dynamic Topic Models. The document provides a brief overview of each method and discusses their applications in text analysis and mining.

Uploaded by

Ahmad Fajar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 6, No. 1, 2015

A Survey of Topic Modeling in Text Mining


Rubayyi Alghamdi Khalid Alfalqi
Information Systems Security Information Systems Security
CIISE, Concordia University CIISE, Concordia University
Montreal, Quebec, Canada Montreal, Quebec, Canada

Abstract—Topic Modeling provides a convenient way to On the side of text analysis and text mining, topic models
analyze big unclassified text. A topic contains a cluster of words rely on the bag-of-words assumption which is ignoring the
that frequently occurs together. A topic modeling can connect information from the ordering of words. According to Seungil
words with similar meanings and distinguish between uses of and Stephen, 2010, ―Each document in a given corpus is thus
words with multiple meanings. This paper provides two represented by a histogram containing the occurrence of
categories that can be considered under the field of topic words. The histogram is modeled by a distribution over a
modeling. First one discusses the area of methods of Topic certain number of topics, each of which is a distribution over
Modeling, which has four methods and can be considered under words in the vocabulary. By learning the distributions, a
this category. These methods are Latent Semantic Analysis corresponding low-rank representation of the high-
(LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent dimensional histogram can be obtained for each document‖ [3]
Dirichlet Allocation (LDA), and Correlated Topic Model (CTM). The various kind of topic models, such as Latent Semantic
The second category is called Topic Evolution Model, it considers Analysis (LSA), Probabilistic Latent Semantic Analysis
an important factor time. In this category, different models are (PLSA), Latent Dirichlet Allocation (LDA), Correlated Topic
discussed, such as Topic Over Time (TOT), Dynamic Topic Model (CTM) have successfully improved classification
Models (DTM), Multiscale Topic Tomography, Dynamic Topic
accuracy in the area of discovering topic modeling [3].
Correlation Detection, Detecting Topic Evolution in scientific
literatures, etc. As time passes, topics in a document corpus evolve,
modeling topics without considering time will confound topic
Keywords—Topic Modeling; Methods of Topic Modeling; discovery. Modeling topics by considering time is called topic
Latent Semantic Analysis (LSA); Probabilistic Latent Semantic evolution modeling. Topic evolution modeling can disclose
Analysis (PLSA); Latent Dirichlet Allocation (LDA); Correlated important hidden information in the document corpus,
Topic Model (CTM); Topic Evolution Model allowing identifying topics with the appearance of time, and
I. INTRODUCTION checking their evolution with time.
To have a better way of managing the explosion of There are a lot of areas that can use topic evolution
electronic document archives these days, it requires using new models. A typical example would be like this: a researcher
techniques or tools that deals with automatically organizing, wants to choose a research topic in a certain field, and would
searching, indexing, and browsing large collections. On the like to know how this topic has evolved over time, and try to
basis of today‘s research of machine learning and statistics, it identify those documents that explained the topic. In the
has developed new techniques for finding patterns of words in second category, paper will review several important topic
document collections using hierarchical probabilistic models. models.
These models are called ―topic models‖. Discovering of These two categories have a good high-level view of topic
patterns often reflect the underlying topics that are united to modeling. In fact, there are helpful ways to better
form the documents, such as hierarchical probabilistic models understanding the concepts of topic modeling. In addition, it
are easily generalized to other kinds of data; topic models have will discuss inside each category. For example, the four
been used to analyze things rather than words such as images, methods that topic modeling rely on are Latent Semantic
biological data, and survey information and data [1]. Analysis (LSA), Probabilistic Latent Semantic Analysis
The main importance of topic modeling is to discover (PLSA), Latent Dirichlet Allocation (LDA), Correlated Topic
patterns of word-use and how to connect documents that share Model (CTM). Each of these methods will have a general
similar patterns. So, the idea of topic models is that term overview, the importance of these methods and an example
which can be working with documents and these documents that can describe the general idea of using this method. On the
are mixtures of topics, where a topic is a probability other hand, paper will mention the areas that topic modeling
distribution over words. In other word, topic model is a evolution provides such as Topic Over Time (TOT), Dynamic
generative model for documents. It specifies a simple Topic Models (DTM), multiscale topic tomography, dynamic
probabilistic procedure by which documents can be generated. topic correlation detection, detecting topic evolution in
scientific literature and the web of topics. Furthermore, it will
Create a new document by choosing a distribution over going to present the overview of each category and provides
topics. After that, each word in that document could choose a examples, if any, and some limitations and characteristics of
topic at random depends on the distribution. Then, draw a each part.
word from that topic. [2]

147 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 6, No. 1, 2015

This paper is organized as follows. Section II provides the important implications: First one, it allows to disambiguate
first category methods of topic modeling with its four methods polysemy, i.e., words with multiple meanings. Second thing, it
and their general concepts as subtitles. Section III overviews discloses typical similarities by grouping together words that
of second category which is topic modeling evolution shared a common context [3].
including its parts. Then it is followed by conclusions in
Section IV. According to Kakkonen, Myller, Sutinen, and Timonen,
2008, ―PLSA is based on a statistical model that is referred as
II. THE METHODS OF TOPIC MODELING an aspect model. An aspect model is a latent variable model
for co-occurrence data, which associates unobserved class
In this section, some of the topic modeling methods will be variables with each observation‖ [4]. The PLSA method
discussed that deals with words, documents and topics. In comes to improve the method of LSA, and also to resolve
addition, the general idea of each of these methods, and other problems that LSA cannot do. PLSA has been successful
present some example for these methods, if any. Also, these in many real-world applications, including computer vision,
methods involve in many applications so it will have a brief and recommender systems. However, since the number of
idea in what applications that can these methods work with. parameters grows linearly with the number of documents,
A. Latent Semantic Analysis PLSA suffers from over fitting problems. Even though, it will
discuss some of these applications later [5].
Latent Semantic Analysis (LSA) is a method or a
technique in the area of Natural Language Processing (NLP). On the other hand, PLSA is based on algorithm and
The main goal of Latent Semantic Analysis (LSA) is to create different aspects. In this probabilistic model, it introduces a
vector based representation for texts‘ to make semantic Latent variable zk ∈ {z1, z2,..., zK}, which corresponds to a
content. By vector representation (LSA) computes the potential semantic layer. Thus, the full model: p (di) on behalf
similarity between texts‘ to pick the heist efficient related of the document in the data set the probability; p (wj | zk) zk
words. In the past LSA was named as Latent Semantic representatives as defined semantics, the related term (word)
Indexing (LSI) but improved for information retrieval tasking. of the opportunities are many; p (zk | di) represents a semantic
So, finding few documents that are close to the query has been document distribution. Using these definitions, will generate
selected from many documents. LSA should have many model, use it to generate new data by the following steps: [3]
aspects to give approach such as key words matching, Wight
key words matching and vector representation depends on 1) Select a document di with probability P (di),
occurrences of words in documents. Also, Latent Semantic 2) Pick a latent class zk with probability P (zk | di),
Analysis (LSA) uses Singular Value Decomposition (SVD) to 3) Generate a word wj with probability P (wj |zk).
rearrange the data.
SVD is a method that uses a matrix to reconfigure and
calculate all the diminutions of vector space. In addition, the
diminutions in vector space will be computed and organized
from most to the least Important. In LSA, the most significant
assumption will be used to find the meaning of the text,
otherwise least important will be ignored during the
assumption. By searching about words that have a high rate of
similarity will occur if those words have similar vector. To
describe the most essential steps in LSA is firstly, collect a
huge set of relevant text and then divide it by documents.
Secondly, make co-occurrence matrix for terms and
documents, also mention the cell name such as document x,
terms y and m for dimensional value for terms and n
dimensional vector for documents. Thirdly, each cell will be
Fig. 1. High-Level View of PLSA
whetted and calculated. Finally, SVD will play a big roll to
compute all the diminutions and make three matrices. PLSA has two different formulations to present this
B. Probabilistic Latent Semantic Analysis method. The first formulation is symmetric formulation, which
Probabilistic Latent Semantic Analysis (PLSA) is an will help to get the word (w) and the document (d) from the
approach that has been released after LSA method to fix some latent class c in similar ways by using the conditional
disadvantages that have found into LSA. Jan Puzicha and probabilities P(d | c) and P(w | c). The second formulation is
Thomas Hofmann introduced it in the year 1999. PLSA is a the asymmetric formulation. In this formulation, each
method that can automate document indexing which is based document d, a latent class, is chosen conditionally to the
on a statistical latent class model for factor analysis of count document according to P(c | d), and the word can be generated
data, and also this method tries to improve the Latent Semantic from that class according to P(w | c) [6]. Each of these two
Analysis (LSA) in a probabilistic sense by using a generative formulations has rules and algorithms that could be used for
model. The main goal of PLSA is to identifying and different purposes. These two formulations have been
distinguishing between different contexts of word usage improved right now and this was happened when they released
without recourse to a dictionary or thesaurus. It includes two the Recursive Probabilistic Latent Semantic Analysis

148 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 6, No. 1, 2015

(RPLAS). This method is extension for the PLAS; also it was C. Latent Dirichlet Allocation
improving for the asymmetric and symmetric formulations. The reason of appearance of Latent Dirichlet Allocation
(LDA) model is to improve the way of mixture models that
capture the exchangeability of both words and documents
from the old way by PLSA and LSA. This was happened in
1990, so the classic representation theorem lays down that any
collection of exchangeable random variables has a
representation as a mixture distribution—in general an infinite
mixture [9].
There are huge numbers of electronic document collections
such as the web, scientifically interesting blogs, news articles
and literature in the recent past has posed several new
challenges to researchers in the data mining community.
Especially there is a growing need of automatic techniques to
visualize, analyze and summarize these document collections.
In the recent past, latent topic modeling has become very
popular as a completely unsupervised technique for topic
discovery in large document collections. This model, such as
LDA [10]
Latent Dirichlet Allocation (LDA) is an Algorithm for text
Fig. 2. A graphical model representation of the aspect model in the
asymmetric (a) and symmetric (b) parameterization [3]
mining that is based on statistical (Bayesian) topic models and
it is very widely used. LDA is a generative model that tries to
In the term of PLSA applications, PLSA has applications mimic what the writing process is. So it tries to generate a
in various fields such as information retrieval and filtering, document on the given topic. It can also be applied to other
natural language processing and machine learning from text. types of data. There are tens of LDA based models including:
In specific, some of these applications are automatic essay temporal text mining, author- topic analysis, supervised topic
grading, classification, topic tracking, image retrieval and models, latent Dirichlet co-clustering and LDA based bio-
automatic question recommendation. Will discuss two of these informatics [11], [18].
applications as follows: In a simple way, the basic idea of the process is, each
 Image retrieval: PLSA model has the visual features document is modeled as a mixture of topics, and each topic is
that it uses to represent each image as a collection of a discrete probability distribution that defines how likely each
visual words from a discrete and finite visual word is to appear in a given topic. These topic probabilities
vocabulary. Having an occurrence of visual words in an provide a concise representation of a document. Here, a
image is hereby counted into a co-occurrence vector. "document" is a "bag of words" with no structure beyond the
Each image has the co-occurrence vectors that can help topic and word statistics.
to build the co-occurrence table that is used to train the
PLSA model. After knowing the PLSA model, can
apply the model to all the images in the database. Then,
the pediment of the vector is to represent it for each
image, where the vector elements denote the degree to
which an image depicts a certain topic [7].
 Automatic question recommendation: One of the
significant application that PLSA deal with is question
recommendation tasks, in this kind of application the
word is independent of the user if the user wants a
specific meaning, so when the user get the answers and
the latent semantics under the questions, then he can
make recommendation based on similarities on these
latent semantics. Wang, Wu and Cheng, 2008 reported
that ―Therefore, PLSA could be used to model the
users‘ profile (represented by the questions that the
user asks or answers) and the questions as well through
Fig. 3. A graphical model representation of LDA
estimating the probabilities of the latent topics behind
the words. Because the user‘s profile is represented by
LDA models each of D documents as a mixture over K
all the questions that he/she asks or answers, we only
latent topics, each of which describes a multinomial
need to consider how to model the question properly‖
distribution over a W word vocabulary. Figure 3 shows the
[8]

149 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 6, No. 1, 2015

graphical model representation of the LDA model. The Dirichlet and conditional multinomial parameters for a
generative process for the basic LDA is as follows: 100-topic LDA model. The top words from some of
the resulting multinomial distributions are illustrated in
For each of Nj words in document j Figure 4. As a result, these distributions seem to
1) Choose a topic zij ∼ Mult(θj) capture some of the underlying topics in the corpus (it
2) Choose a word xij ∼ Mult(φzij ) is named according to these topics [9].
Where the parameters of the multinomials for topics in a
document θj and words in a topic φk have Dirichlet priors [12]
Indeed, there are several applications and models based on
the Latent Dirichlet Allocation (LDA) method such as:
 Role discovery: Social Network Analysis (SNA) is the
study of mathematical models for interactions among
people, organizations and groups. Because of the
emergence connections among the 9/11 hijackers and
the huge data sets of human on the popular web service
like facebook.com and MySpace.com, there has been
growing interest in social network analysis. That leads
to exist of Author-Recipient-Topic (ART) model for
Social Network Analysis. The model combines the
Latent Dirichlet Allocation (LDA) and the Author-
Topic (AT) model. The Idea of (ART) is to learn topic
distributions based on the direction-sensitive messages
sent between the senders and receivers [13].
 Emotion topic: The Pairwise-Link-LDA model, which
is focused on the problem of joint modeling of text and
citations in the topic modeling area. It is built on the
idea of LDA and Mixed Membership Stochastic Block
Models (MMSB) and allows modeling arbitrary link Fig. 4. Most likely words from 4 topics in LDA from the AP corpus: the
topic titles in quotes are not part of the algorithm
structure [14].
 Automatic essay grading: The Automatic essay grading D. Correlated topic model
problem is closely related to automatic text Correlated Topic Model (CTM) is a kind of statistical
categorization, which has been researched since 1960s. model used in natural language processing and machine
Comparison of Dimension Reduction Methods for learning. Correlated Topic Model (CTM) used to discover the
Automated Essay Grading. LDA has been shown to be topics that shown in a group of documents. The key for CTM
reliable methods to resolve information retrieval tasks is the logistic normal distribution. Correlated Topic Models
from information filtering and classification to (CTM) is depending on LDA.
document retrieval and classification [15].
TABLE I. THE CHARACTERISTICS OF TOPIC MODELING METHODS [17]
 Anti-Phishing: Phishing emails are ways to theater the
sensitive information such as account information, Name of The Methods Characteristics
credit card, and social security numbers. Email * LSA can get from the topic if
Filtering or web site filtering is not the effective way to Latent Semantic Analysis there are any synonym words.
prevent the Phishing emails. Because latent topic (LSA) * Not robust statistical background.
models are clusters of words that appear together in * It can generate each word from a
email, user can expect that in a phishing email the Probabilistic Latent single topic; even though various
words "click" and "account" often appear together. Semantic Analysis (PLSA) words in one document may be
Usual latent topic models do not take into account generated from different topics.
different classes of documents, e.g. phishing or non- * PLSA handles polysemy.
phishing. For that reason the researchers developed a * Need to manually remove stop-
new statistical model, the latent Class-Topic Model Latent Dirichelet Allocation words.
(CLTOM), which is an extension of Latent Dirichlet (LDA) * It is found that the LDA cannot
Allocation (LDA) [16]. make the representation of
relationships among topics.
 Example of LDA: This section is to provide an Correlated Topic Model * Using of logistic normal
illustrative example of the use of an LDA model on (CTM) distribution to create relations
real data. By using the subset of the TREC AP corpus among topics.
containing 16,000 documents. First, remove the stop- * Allows the occurrences of words
words in TREC AP corpus before running topic in other topics and topic graphs.
modeling. After that, use the EM algorithm to find the

150 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 6, No. 1, 2015

TABLE II. THE LIMITATIONS OF TOPIC MODELING METHODS [17] The main point of this paper is that it models topic
Name of The Limitations evolution without discretizing time or making Markov
Methods assumptions that the state at time t + t1 is independent of the
Latent Semantic - It is hard to obtain and to determine state at time t. By using this method on U.S. Presidential State-
Analysis (LSA) the number of topics. of-the-Union address for two centuries, TOT discovers topics
- To interpret loading values with of time-localization and also improves the word-clarity over
probability meaning, it is hard to LDA. Another experimental result on the 17-year NIPS
operate it. conference demonstrates clear topical trends.
Probabilistic - At the level of documents, PLSA
Latent Semantic cannot do probabilistic model. C. Dynamic Topic Models (DTM)
Analysis (PLSA) The authors in this paper developed a statistical model of
Latent Dirichelet - It becomes unable to model relations topic evolution, and develop approximate posterior inference
Allocation (LDA) among topics that can be solved in CTM techniques to decide the evolving topics from a sequential
method. document collection [20]. It assumes that corpus of documents
Correlated Topic - Require lots of calculation is organized based on time slices, and the documents of each
Model (CTM) - Having lots of general words inside time slice are modeled with K-component model, and topics
the topics. associated with time slice t evolve from topics corresponding
to slice time t-1.
III. METHODS ABOUT TOPIC EVOLUTION MODELS
Dynamic topic models estimate topic distribution at
A. Overview of topic evolution models different epochs. It uses Gaussian primarily for the topic
When time goes by, the themes of a document corpus parameters instead of Dirichlet, and can capture the topic
evolve. Modeling topics without considering time will cause evolution over time slices. By using this model, it has been
problems. For example, in analyzing topics of U.S. inferred that what words are different from the previous
Presidential State-of-the-Union addresses, LDA did not epochs can be predicted.
correctly do it by confounding Mexican-American War with
some aspects of World War I, since LDA did not notice that D. Multiscale Topic Tomography
there were 70-years of separation between the two events. This method assumes that the document collection is
sorted in the ascending order, and that the document collection
It is important to model topic evolution, so people can is grouped into equal-sized chunks, each of which represents
identify topics within the context (i.e. time) and see how topics the documents of one epoch. Each document in an epoch is
evolve over time. There are a lot of applications where topic represented by a word-count vector, and each epoch is
evolution models can be applied. For example, by checking associated with its word generation Poisson parameters, each
topic evolution in scientific literature, it can see the topic of which represents the expected word counts from a topic.
lineage, and how research on one topic influences on another. Non-homogeneous Poisson process was used to model word
This section will review several important papers related to counts, since it is a natural way to do the task, and also
model topic evolutions. This paper will review model topic because it is amendable to sequence modeling through
evolution by using different models, but all of them have Bayesian multi-scale analysis. Multi-scale analysis was also
considered the important factor ‗time‘. For example, employed to the Poisson parameters, which can model the
probabilistic time series models are used to handle the issues temporal evolution of topics at different time-scales.
in paper ―dynamic topic models‖ and ―non-homogeneous This method is similar to DTM, but provides more
Poisson processes‖ and ―multi-scale analysis‖ with ―Haar flexibility by allowing studying the topic evolution with
wavelets‖ being employed in paper ―multiscale topic various time-scales [21].
tomography‖ .
E. A Non-parametric Approach to Dynamic Topic
B. A Non-Markov Continuous-Time Method Correlation Detection (DCTM)
Since most of the big data sets have dynamic co- This method models topic evolution by discretizing time
occurrence patterns, word and topic co-occurrence patterns [22]. In this method, each corpus contains a set of documents,
change over time, TOT model topics and their changes are each of which contains documents with the same timestamp. It
done over time by taking into account both the word co- assumes that all documents in a corpus share the same time-
occurrence pattern and time [19]. In this method, a topic is scale, and that each document corpus shares the same
considered as being associated with a continuous distribution vocabulary of size d.
over time.
Basically, DCTM maps the high-dimensional space
In TOT, for each document, multinomial distribution over (words) to lower-dimensional space (topics), and models the
topics is sampled from Dirichlet, words are generated from dynamic topic evolution in a corpus. A hierarchy over the
multinomial of each topic, and Beta distribution of each topic correlation latent space is constructed, which is called
generates the document‘s time stamp. If there exists a pattern temporal prior. The temporal prior is used to capture the
of a strong word co-occurrence for a short time, TOT will dynamics of topics and correlations.
create a narrow-time-distribution topic. If a pattern of a strong
word co-occurrence exists for a while, it will generate a broad- DCTM works as follows: First of all, for each document
time-distribution topic. corpus, the latent topics are discovered, and this is done by

151 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 6, No. 1, 2015

first summarizing the contribution of documents at certain It works as follows: first, it tries to identify a new topic by
time, which is done by aggregating the features in all identifying significant content changes in a document corpus.
documents. Then, Gaussian process latent variable model (GP- If the new content is different from the original content and is
LVM) is used to capture the relationship between each pair of shared by later documents, it is being identified as a new topic.
document and topic set. Next, hierarchical Gaussian process
latent variable model (HGP-LVM) is employed to model the The next step is to explore the relationship between the
relationship between each pair of topic sets. They also use the new topics and the original topics. It works by finding member
posterior inference of topic and correlations to identify the documents of each topic, and examining the relationship. It
dynamic changes of topic-related word probabilities, and to also uses citation relationship to find member documents of
predict topic evolution and topic correlations. each topic. That is, if a paper is being cited as start paper, this
will be considered as the member paper of the start paper. In
An important feature of this paper is that it is non- addition, papers that are textually close to the start paper are
parametric model since it can marginalize out the parameters, also considered as member paper of the start paper. The
and it exhibits faster convergence than the generative relationship between the original topics and the new
processes. discovered topics is identified by citation count. Their
experimental results demonstrate that citations can better
F. Detecting Topic Evolution of Scientific Literature understand topic evolutions.
This method employs the observation that citation
indicates important relationship between topics, and it uses H. Summary of topic evolution models
citation to model topic evolution of scientific literature [23]. This paper summarizes the main characteristics of topic
Not only papers that are in a corpus D(t) but cited papers are evolution models discussed in section 3, which is listed as
also considered for topic detection. It uses Bayesian model to follows:
identify topic evolution.
TABLE III. THE MAIN CHARACTERISTICS OF TOPIC EVOLUTION MODELS
In this method, ―a document consists of a vocabulary
distribution, a citation and a timestamp‖. Document corpus is Main characteristics of Models
divided into a set of subsets based on the timestamp, for time models
unit t, the corresponding documents are represented with D(t). Modeling topic evolution by 1)―Topics over time: a non-
For each time unit, topics are generated independently. The continuous-time model markov continuous-time
topic evolution analysis in this paper is specified to analyze model of topical trends‖
the relationship between topics in D(t) and those in D(t-1). In Modeling topic evolution by 1)―Dynamic topic models‖
other words, it models topic evolution by discretizing time. discretizing time 2)―Multiscale topic
tomography‖
They first proposed two citation-unaware topic evolution 3)“ANon-parametric
learning methods for topic evolution: independent topic Approach to Pair-wise
evolution learning method and accumulative topic evolution Dynamic Topic Correlation
learning method. In independent topic evolution learning Detection”
method, topics in D(t) are independent from those in D(t-1), Modeling topic evolution by 1) ―Detecting topic
while in accumulative topic evolution learning method, topics using citation relationship as evolution in scientific
in D(t) are dependent on those in D(t-1). Then, Citation is well as discretizing time literature: How can citations
integrated into the above two approaches, which is an iterative help‖
learning process based on Dirichlet prior smoothing. The 2) ―The Web of Topics:
Discovering the Topology
iterative learning process takes into account the fact that
of Topic Evolution in a
different citations have different importance on topic Corpus‖
evolution. Finally an inheritance topic model is proposed to
capture how citations can be employed to analyze topic I. Comparison of Two Categories
evolution. The main difference of the two categories is as follows: In
G. Discovering the Topology of Topics the first category, model topics are considered without time
and are basically model words. While in the second category,
A topic is semantically coherent content that is shared by a
model topics are considering time viz. Continuous time,
document corpus. When time passes, some documents in a
discretizing time, or by combining time discretization and
topic may initiate a content that differs obviously from the
citation relationship.
original content. If the initiated content is shared by a lot of
later documents, the content is identified as a new topic. This Due to the different characteristics of these two categories,
paper is to discover this evolutionary process of topics. In this the methods in the second category are more accurate in terms
paper, a topic is defined as ―a quantized unit of evolutionary of topic discovery.
change in content‖.
IV. CONCLUSION
This method develops an iterative topic evolution learning
framework by integrating Latent Dirichlet Allocation into This survey paper, presented two categories that can be
citation network. It also develops an inheritance topic model under the term of topic modeling in text mining. In the first
by using citation counts. category, it has discussed general idea about the four topic
modeling methods including Latent Semantic Analysis (LSA),

152 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 6, No. 1, 2015

Probabilistic Latent Semantic Analysis (PLSA), Latent features and tags‖, The Institute of Electrical and Electronics Engineers
Dirichlet Allocation (LDA) and Correlated Topic Model Inc., 2009, 414-417.
(CTM). In addition, it explained the difference between these [8] Wu, H., Wang, Y., and Cheng, X., ―Incremental probabilistic latent
semantic analysis for automatic question recommendation‖, ACM New
four methods in terms of characteristics, limitations and the York, NY, USA, 2008, 99-106.
theoretical backgrounds. Paper does not go into specific details [9] Blei, D.M., Ng, A.Y., and Jordan, M.I., ―Latent Dirichlet Allocation‖,
of each of these methods. It only describes the high-level view Journal of Machine Learning Research, 3, 2003, 993-1022.
of these topics that relates to topic modeling in text mining. [10] Ahmed,A., Xing,E.P., and William W. ―Joint Latent Topic Models for
Furthermore, it has also mentioned some of the applications Text and Citations‖, ACM New York, NY, USA, 2008.
being involved in these four methods. Also, it has been [11] Zhi-Yong Shen,Z.Y., Sun,J., and Yi-Dong Shen,Y.D., ―Collective Latent
mentioned that each of these four methods has improved and Dirichlet Allocation‖, Eighth IEEE International Conference on Data
modified over the previous one. Model topics without taking Mining, pages 1019–1025, 2008.
into account ‗time‘ will confound the topic discovery. In the [12] Porteous, L.,Newman,D., Ihler, A., Asuncion, A., Smyth, P., and
second category, paper has discussed the topic evolution Welling, M., ―Fast Collapsed Gibbs Sampling For Latent Dirichlet
Allocation‖, ACM New York, NY, USA, 2008.
models, considering time. Several papers have used different
[13] McCallum, A., Wang, X., and Corrada-Emmanuel, A., ―Topic and role
methods of model topic evolution. Some of them have used discovery in social networks with experiments on enron and academic
discretizing time, continuous-time model, or citation email‖, Journal of Artificial Intelligence Research, 30 (1), 2007, 249-
relationship as well as time discretization. All of these papers 272.
have considered the important factor ‗time‘. [14] Bao, S., Xu, S., Zhang, L., Yan, R., Su, Z., Han, D., and Yu, Y., ―Joint
Emotion-Topic Modeling for Social Affective Text Mining‖, Data
REFERENCES
Mining, 2009. ICDM ‗09. Ninth IEEE International Conference, 2009,
[1] Blei, D.M., and Lafferty, J. D. ―Dynamic Topic Models‖, Proceedings of 699-704.
the 23rd International Conference on Machine Learning, Pittsburgh, PA, [15] Kakkonen, T., Myller, N., and Sutinen, E., ―Applying latent Dirichlet
2006. allocation to automatic essay grading‖, Lecture Notes in Computer
[2] Steyvers, M., and Griffiths, T. (2007). ―Probabilistic topic models‖. In Science, 4139, 2006, 110-120.
T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent [16] Bergholz, A., Chang, J., Paaß, G., Reichartz, F., and Strobel, S.,
Semantic Analysis: A Road to Meaning. Laurence Erlbaum ―Improved phishing detection using model-based features‖, 2008.
[3] Hofmann, T., ―Unsupervised learning by probabilistic latent semantic [17] Lee,S., Baker,J., Song,J., and Wetherbe, J.C., ―An Empirical
analysis‖, Machine Learning, 42 (1), 2001, 177- 196. Comparison of Four Text Mining Methods‖, Proceedings of the 43rd
[4] Kakkonen, T., Myller, N., Sutinen, E., and Timonen, J., ―Comparison of Hawaii International Conference on System Sciences, 2010.
Dimension Reduction Methods for Automated Essay Grading‖, [18] X. Wang and A. McCallum. ―Topics over time: a non-markov
Educational Technology &Society, 11 (3), 2008, 275-288. continuous-time model of topical trends‖. In International conference on
[5] Liu, S., Xia, C., and Jiang, X., ―Efficient Probabilistic Latent Semantic Knowledge discovery and data mining, pages 424–433, 2006.
Analysis with Sparsity Control‖, IEEE International Conference on [19] D. M. Blei and J. D. Lafferty. Dynamic topic models. In International
Data Mining, 2010, 905-910. conference on Machine learning, pages 113–120, 2006.
[6] Bassiou, N., and Kotropoulos C. ―RPLSA: A novel updating scheme for [20] R. M. Nallapati, S. Ditmore, J. D. Lafferty, and K. Ung. Multiscale topic
Probabilistic Latent Semantic Analysis‖, Department of Informatics, tomography. In Proceedings of KDD‘07, pages 520–529, 2007.
Aristotle University of Thessaloniki, Box 451 Thessaloniki 541 24,
Greece Received 14 April 2010. [21] A Non-parametric Approach to Pair-wise Dynamic Topic Correlation
Detection. In Proceedings of the Eighth IEEE International Conference
[7] Romberg, S., Hörster, E., and Lienhart, R., ―Multimodal pLSA on visual on Data Mining (ICDM 2008), Pisa, Italy. December 2008.
[22] Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, and C. L. Giles. Detecting topic
evolution in scientific literature: How can citations help? In CIKM,
2009.
[23] Yookyung Jo, John E. Hopcroft, and Carl Lagoze. The Web of Topics:
Discovering the Topology of Topic Evolution in a Corpus, The 20th
International World Wide Web Conference, 2011.

153 | P a g e
www.ijacsa.thesai.org

You might also like