A Gentle Introduction To Topic Modeling Using Pyth
A Gentle Introduction To Topic Modeling Using Pyth
Abstract
Topic modeling is a data mining method used to understand and categorize large corpora of data. As such, it is a tool
that theological librarians can use in their professional workflows and scholarly practices. In this article, I provide a gentle
introduction to topic modeling using the Python programming language for those who have no prior knowledge of the
topic. I begin with a conceptual overview of topic modeling which does not rely on the complicated mathematics behind the
process. Then, I illustrate topic modeling by providing a step-by-step example of building a topic model using Theological
Librarianship as an example corpus. This example ends with an analysis of the success of the model and suggestions for
improvement. Finally, I comment on the practical application of topic modeling for library workflows.
Introduction
As information professionals, theological librarians benefit from an awareness of tools that have been developed to help sort
through and understand the increasing deluge of information we encounter every day. A topic model is one such tool that,
when employed correctly, can allow its user to understand large corpora of documents. These documents can be anything:
scholarly articles, historical resources, or tweets. The only requirement is that they are available in a digital format.
There is an unofficial tradition in introductory literature about programming or technical computer topics: the tradition of
providing a “gentle introduction.” The idea behind a gentle introduction is to invite the reader into a topic without getting
overly technical. It is in that spirit that I offer the following gentle introduction to topic modeling. The introduction will
be gentle because I will not discuss the mathematics behind topic modeling nor will I provide the Python code I wrote as a
part of this project. For those interested in the code, I have made it available on Github.1
Conceptual Overview
In general, a model is a representation of an object that captures what is important about that object in a way that is more
easily used. For example, an architectural model scales down a building so that it can be more readily observed without
sacrificing the proportions and detail of the original. A topic model is similar; it models a corpus of documents, not by
scaling them down, but by generating topics that are representative of the content of the document corpus. To be more
specific, a topic model is a probabilistic model used to discover topics, or latent structures, across a collection of documents.2
These topics can then be used to organize the documents or develop controlled vocabularies that describe the documents.
While the algorithms that lie behind a topic model are complex, a conceptual explanation of how a topic model operates
is possible.3 An underlying assumption is that any given document contains within it latent topics.4 These topics are
latent because they are not explicitly signaled by the author. Topics, in turn, are composed of words with similar semantic
1
See https://github.com/msaxton/tl_topic_model.
2
Rubayyi Alghamdi and Khalid Alfalqi, “A Survey of Topic Modeling in Text Mining,” International Journal of Advanced
Computer Science and Applications 6, no. 1 (2015): 147–153. https://doi.org/10.14569/IJACSA.2015.060121.
3
For the math behind the algorithm used here see David M. Blei et al., “Latent Dirichlet Allocation,” Journal of Machine
Learning Research 3, no. 4/5 (May 15, 2003): 993–1022.
4
Megan R. Brett, “Topic Modeling: A Basic Introduction,” Journal of Digital Humanities, April 8, 2013, http://
journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/.
Micah D. Saxton is Collections Development and Reference Librarian at the Ira J. Taylor Library at Iliff School of Theology
in Denver.
18
theolib.org
PEER REVIEW: A GENTLE INTRODUCTION
domains. To put all of this another way, words can be grouped together to form topics, and topics are grouped
together to form documents. The probability of a topic being present in a document can be measured by the words
used in the document.
Take the front page of a newspaper as an example: a newspaper contains a number of articles (documents) which
can be taken together as a corpus. There are a number of words that may appear in several different articles such as
“Middle East,” “law,” “border,” “immigration officer,” “senate,” “visa,” “illegal,” “policy,” “undocumented,” “asylum,”
“court,” “president,” “legislation,” and “international.” A topic model algorithm would iterate through these articles
and record patterns of word co-occurrence. On the basis of such patterns it may sort “Middle East,” “border,”
“immigration officer,” “visa,” illegal,” “undocumented,” and “asylum” into one group and sort “policy,” “senate,”
“law,” “court,” “president,” “legislation,” and “international” into another. Each of these groupings are considered
topics and we could label the first “immigration” and the second “politics.”
Of course, some words could be placed in both groupings. The topic model algorithm would also assign a probability
value of each term belonging to the group to which it has been assigned. Additionally, the topic model algorithm
would assign a probability of each grouping being found in a given document. It is important to remember that this
process occurs on the basis of any number of algorithms and not on the basis of human conceptual categories, even
though the latter may be used to interpret the former. Insofar as a topic model does not sort documents on the basis
of pre-defined categories, it is an instance of “unsupervised machine learning.”5
Examples of Application
There are a number of useful applications of topic modeling for librarians and scholars. One such application is to
use a topic model to assist in the classification of documents. The manual classifying of documents poses a number
of challenges. There is the practical challenge of how long it takes for human catalogers to classify documents. For
example, the efforts of librarians at the National Medical Library who are responsible for assigning medical subject
headings (MeSH) to articles published in biomedical fields take an extraordinary amount of time and are expensive.6
MeSh contains over 26,000 terms, and indexers look through an entire article in order to assign those subject
headings. According to Ramakanth Kavuluru and Yuan Lu, this process could be streamlined by supplementing
some of the process with unsupervised machine learning techniques such as a topic model.7 These authors argue that
topic models could be constructed on the corpus of already indexed biomedical articles which could then be used
to classify incoming articles.8 This process still requires humans to map topics to controlled vocabularies, but such
efforts would go a long way toward streamlining the process.
The challenges to manual classification of documents are not just practical; there are theoretical challenges as well.
Scientists are often in the business of producing new knowledge, but that new knowledge may not always relate
neatly to current subject headings. Arho Suominen and Hannes Toivanen argue specifically, “The central novelty
of unsupervised-learning methods in classifying scientific knowledge is that they virtually eliminate the need to
fit new-to-the-world knowledge into known-to-the-world definitions.”9 By “unsupervised-learning methods” these
5
Graham Upton and Ian Cook, “Machine Learning,” in A Dictionary of Statistics, 3rd ed. (Oxford University
Press, 2014), http://www.oxfordreference.com.du.idm.oclc.org/view/10.1093/acref/9780199679188.001.0001/acref-
9780199679188-e-2380.
6
Ramakanth Kavuluru and Yuan Lu, “Leveraging Output Term Co-Occurrence Frequencies and Latent Associations
in Predicting Medical Subject Headings,” Data & Knowledge Engineering, Special issue following the 18th International
Conference on Applications of Natural Language Processing to Information Systems (NLDB’13), 94 (November 1, 2014):
189–201. https://doi.org/10.1016/j.datak.2014.09.002.
7
Kavuluru and Lu, “Leveraging Output.”
8
Kavuluru and Lu, “Leveraging Output.”
9
Arho Suominen and Hannes Toivanen, “Map of Science with Topic Modeling: Comparison of Unsupervised Learning
and Human-Assigned Subject Classification,” Journal of the Association for Information Science & Technology 67, no. 10
(October 2016): 2465, https://doi.org/10.1002/asi.23596.
19
theolib.org
APRIL 2018: VOLUME 11, NUMBER 1 • THEOLOGICAL LIBRARIANSHIP
authors mean topic modeling. They go on to argue for the use of topic modeling as a way of classifying documents by
comparing topic modeling with human-assigned subject classification.10 These authors do not suggest that humans
turn over the entirety of document classification to machines, but they do suggest that topic modeling has certain
advantages when it comes to classifying new knowledge.
These are not the only examples of information professionals using topic modeling to aid in the classification of
documents. Topic modeling has been used in the fields of public health to organize information about substance
abuse and depression among teens.11 It has been used to automatically tag webpages.12 It has also been used to
organize documents for more efficient information retrieval.13 In short, topic modeling is a valuable tool for
information professionals who organize information. There is no reason why theological librarians cannot also use
topic modeling to their own advantage. The first step is to learn how topic models work, and the second step is
to think creatively about how topic models can be applied to the professional activities and scholarly pursuits of
theological librarians.
Software
There are a number of software packages available in different programming languages for generating topic models.
MALLET is one popular option; it is written in Java, but there are wrappers available in both Python and R.14 For
those who are more familiar with the programming language R, there are a few different packages available such as
topicmodels15 and lda.16 For those with a preference for the Python programming language, Genism is an increasingly
popular topic modeling package.17 Genism is notable for its scalability (it can handle corpora containing tens of
thousands of documents) and its user interface. Any of these software options are sufficient for doing a topic model,
but here I have selected Genism primarily because of the popularity and the ease of use of the Python language.
That said, many of the steps in what follows are described on a conceptual level which will be of use no matter what
software one uses.
10
Suominen and Toivanen, “Map of Science with Topic Modeling.”
11
Wang Shi-Heng et al., “Text Mining for Identifying Topics in the Literatures about Adolescent Substance Use and
Depression,” BMC Public Health 16 (2016), https://doi.org/10.1186/s12889-016-2932-1.
12
Maria Lin and David W. Cheung, “An Automatic Approach for Tagging Web Services Using Machine Learning
Techniques,” Web Intelligence (2405-6456) 14, no. 2 (April 2016): 99–118. https://doi.org/10.3233/WEB-160334.
13
Shoaib Jameel, Wai Lam, and Lidong Bing, “Supervised Topic Models with Word Order Structure for Document
Classification and Retrieval Learning,” Information Retrieval Journal 18, no. 4 (August 2015): 283–330. https://doi.
org/10.1007/s10791-015-9254-2.
14
For an introduction to MALLET see Shawn Graham, Scott Weingart, and Ian Milligan, “Getting Started with Topic
Modeling and MALLET,” Programming Historian, September 2, 2012, https://programminghistorian.org/lessons/topic-
modeling-and-mallet.
15
Bettina Grun and Kurt Hornik, “Topicmodels: An R Package for Fitting Topic Models,” Journal of Statistical Software
40, no. 13 (May 1, 2011), https://doaj.org.
16
Jonathan Chang, Lda: Collapsed Gibbs Sampling Methods for Topic Models, version 1.4.2, n.d., https://cran.r-project.org/
web/packages/lda/lda.pdf.
17
R. Rehurek and P. Sojka, “Software Framework for Topic Modeling with Large Corpora,” in Proceedings of the LREC
2010 Workshop on New Challenges for NLP Frameworks (Valletta, Malta: ELRA, 2010).
20
theolib.org
PEER REVIEW: A GENTLE INTRODUCTION
Genism is the tool that does the computational work of constructing the model; all it needs is a properly prepared
corpus of documents. Gensim operates with three core concepts: corpus, vector, and model.18 The corpus is the
collection of documents from which the model is generated. These documents must be prepared in a specific way to
be of use to Gensim; namely, each document of the corpus must be made into a list of words. Each word in a corpus
can be thought of as a feature of the corpus. If every word in the corpus is a feature, it is clear that each document
will have many, but not all, features available in the corpus. A vector, then, is a representation of each document that
tallies each of the features that document contains. Consider two similar sentences:
(a) Corpora are composed of many documents.
(b) Documents contain many features.
If these two sentences are taken together as a small corpus, we could say that the corpus has eight unique features:
“corpora,” “are,” “composed,” “of,” “many,” “documents,” “contain,” and “features.” We can then represent the
sentences in this way:
(a) 1 1 1 1 1 1 0 0
(b) 0 0 0 0 1 1 1 1
The rows in the above table (excluding the header) are vector representations of each sentence which tally how many
times each feature appears in each sentence. Finally, in the context of Gensim, a model is a representation of a corpus.
It is a way of referring to the transformation of one document representation to another. As a basic example, the table
above is a model of sentences a and b. Sentences a and b have been transformed from sentences to a table.
Step-by-Step Procedure
The actual process of topic modeling involves five major steps: (1) create a document corpus, (2) preprocess the text
to gather the most informative features, (3) process the text into a corpus that can be used by Gensim (that is, turn
each document into a vector), (4) build the topic model, and (5) analyze the topic model.
18
R. Rehurek, “Gensim: Topic Modelling for Humans,” accessed February 4, 2018, https://radimrehurek.com/gensim/
intro.html.
21
theolib.org
APRIL 2018: VOLUME 11, NUMBER 1 • THEOLOGICAL LIBRARIANSHIP
19
For specifics see https://github.com/msaxton/tl_topic_model/blob/master/tl_topic_model.py.
22
theolib.org
PEER REVIEW: A GENTLE INTRODUCTION
After making the dictionary, Gensim can now create a corpus to train the topic model. The corpus created here is different
from the starting corpus in that now each document is represented by a series of number pairs. The first number in the
pair represents the integer ID assigned to a word type and the second number in the pair represents how often that word
type occurs. A document in this corpus may then looks something like: [(9,5) (72, 1) (56, 2)…] where in the first pairing
“9” may refer to the word “library” (for example) and “5” refers to the number of times the word is used in the document.
20
Blei et al., “Latent Dirichlet Allocation.”
23
theolib.org
APRIL 2018: VOLUME 11, NUMBER 1 • THEOLOGICAL LIBRARIANSHIP
Each of these articles comes from volume 5 issue 1 from a special forum titled “The Reshaping of Libraries.” In this
instance, the topic model successfully identified a meaningful topic and properly associated relevant articles with the
topic.
Topic 0 provides us another example (note: computers generally start a list with 0 not 1):
(0,
‘0.045*”church” + 0.021*”pope” + 0.013*”christian” + 0.012*”fie” +
0.011*”john” + 0.010*”franci” + 0.009*”middle” + 0.008*”francis” + ‘
0.008*”catechism” + 0.007*”life”’)
There are clearly some problems with words in this topic. The word “franci” should be “francis” and the word “fie” is
likely the result of poor text extraction from the pdf. Nonetheless, this group of words clearly represents a coherent
semantic domain. The topic model associated the following articles with this topic with a 30 percent or higher
probability:
• Keith Edward Lemna, “Pope Francis’ Strong Thought,” Theological Librarianship 7, no. 2 (2014): 45-53.
• Lorraine H. Olley, “Benedict Biscop: Benedictine, Builder, Bibliophile,” Theological Librarianship 7, no. 1
(2014): 30-37.
• Michael R. Mitchell, “Christian Catechetical Texts,” Theological Librarianship 5, no. 2 (2012): 92-94.
• Matthew Baker, “Christian Traditions in the Contemporary Middle East,” Theological Librarianship 4, no. 1
(2011): 68-74.
• Fred Guyette, “The Literature of Ecclesiology: A Ten Year Retrospective,” Theological Librarianship 4, no. 1
(2011): 75-90.
• Katharina Penner, “Information Behaviour(s) of Theologians: A Literature Review,” Theological Librarianship
2, no. 1 (2009): 67-82.
• Fred Guyette, “An Open Access Source for the Study of Religion and the Law: The Proceedings of the Old
Bailey: London’s Central Criminal Court 1674-1913,” Theological Librarianship 1, no. 2 (2008): 28-37.
While not all of these articles have an explicit focus on Pope Francis or Catholicism, they each are related to the topics
of church or ecclesiology in ways that other articles in TL are not.
Topic 36 appears to be primarily about Methodism, but “consortium” and a few other terms form this topic, so
there is room for improvement here:
(36,
‘0.054*”methodist” + 0.030*”church” + 0.024*”united” + 0.023*”methodism” + 0.018*”consortium” +
0.014*”denomination” + 0.011*”culture” + 0.011*”episcopal” + 0.009*”draw” + 0.009*”century”’)
The topic model associated the following articles with this topic:
• Christopher J. Anderson, “We Desire Everything Illustrating the History of Methodism That We Can
Procure: Examining the Methodist Collections at Drew University,” Theological Librarianship 6, no. 1 (2013):
9-15.
• James Wiser, “‘Playing Well With Others:’ New Opportunities for Library Consortia,” Theological
Librarianship 5, no. 2 (2012): 43-47.
• Mark R. Teasdale, “Growth of Declension: Methodist Historians’ Treatment of the Relationship Between the
Methodist Episcopal Church and the Culture of the United States,” Theological Librarianship 3, no. 2 (2010):
34-44.
The three topics above provided examples of coherent semantic domains (even though there is room for improvement
in each). However, not all topics generated by this model are as informative.
24
theolib.org
PEER REVIEW: A GENTLE INTRODUCTION
Conclusion
By way of conclusion we can ask: How can the process of topic modeling be applied to library related projects? Alternatively,
why should a librarian bother to learn this skill? Insofar as many library workflows aim at the organization of information,
there is potential for topic modeling to enhance the activities of a librarian. Here are some examples.
First, many academic libraries serve as the repository for theses written by graduate students. Using a topic model trained
on an existing collection of theses would allow a librarian to automate much of the process of cataloging new theses. I am
not suggesting that the entire process of cataloging be automated, but cataloging workflows for unpublished materials like
graduate theses could be made more efficient by mapping a topic model’s topics to a controlled vocabulary used by an
institution. This is especially true if a library has a sizable collection of graduate theses in digital form.
Second, and similarly, as digital archival collections grow, there is a growing need to make these collections more discoverable.
An archivist could train a topic model on an existing digital collection and use that model to group sub-collections and
assign key words to digital objects. The digital archivist could then use the topic model as the basis for suggesting “related
items” to any digital object.
25
theolib.org
APRIL 2018: VOLUME 11, NUMBER 1 • THEOLOGICAL LIBRARIANSHIP
Finally, many libraries collect qualitative survey data. If that data becomes too large to be easily manageable, a topic model
could help librarians break it down into categories that are easy to process. These are just a few examples of how librarians
can apply topic models to their work. These examples are by no means exhaustive, and with a little effort the readers of this
journal could think of many more. In fact, the more that librarians familiarize themselves with skills like topic modeling,
the more creative librarians can get in applying such skill sets.
The goal of this gentle introduction was to expose the readers of Theological Librarianship to topic models as a way to
understand and process large corpora of documents. Toward that end, I discussed topic modeling on a conceptual level,
I pointed out a few applications, and then I provided a step-by-step example of building a topic model of Theological
Librarianship including an analysis of that example. That topic model was somewhat successful in modeling Theological
Librarianship in helpful ways, but I also made a few suggestions whereby that model may be improved. My hope is that this
article inspires a curiosity in its readers to learn more about topic modeling and find creative applications of this process for
their professional workflows and scholarly practices.
Bibliography
Blei, David M., Andrew Y. Ng, Michael I. Jordan, and John Lafferty. “Latent Dirichlet Allocation.” Journal of
Machine Learning Research 3, no. 4/5 (May 15, 2003): 993–1022.
Brett, Megan R. “Topic Modeling: A Basic Introduction.” Journal of Digital Humanities, April 8, 2013. http://
journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/.
Chang, Jonathan. Lda: Collapsed Gibbs Sampling Methods for Topic Models (version 1.4.2), n.d. https://cran.r-
project.org/web/packages/lda/lda.pdf.
Graham, Shawn, Scott Weingart, and Ian Milligan. “Getting Started with Topic Modeling and MALLET.”
Programming Historian, September 2, 2012. https://programminghistorian.org/lessons/topic-modeling-and-
mallet.
Grun, Bettina, and Kurt Hornik. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical
Software 40, no. 13 (May 1, 2011). https://doaj.org.
Jameel, Shoaib, Wai Lam, and Lidong Bing. “Supervised Topic Models with Word Order Structure for Document
Classification and Retrieval Learning.” Information Retrieval Journal 18, no. 4 (August 2015): 283–330. https://
doi.org/10.1007/s10791-015-9254-2.
Kavuluru, Ramakanth, and Yuan Lu. “Leveraging Output Term Co-Occurrence Frequencies and Latent
Associations in Predicting Medical Subject Headings.” Data & Knowledge Engineering, Special issue following
the 18th International Conference on Applications of Natural Language Processing to Information Systems
(NLDB’13), 94 (November 1, 2014): 189–201. https://doi.org/10.1016/j.datak.2014.09.002.
Lin, Maria, and David W. Cheung. “An Automatic Approach for Tagging Web Services Using Machine Learning
Techniques.” Web Intelligence (2405-6456) 14, no. 2 (April 2016): 99–118. https://doi.org/10.3233/WEB-
160334.
Rehurek, R. “Gensim: Topic Modelling for Humans.” Accessed February 4, 2018. https://radimrehurek.com/
gensim/intro.html.
Rehurek, R., and P. Sojka. “Software Framework for Topic Modeling with Large Corpora.” In Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA, 2010.
Rubayyi Alghamdi, and Khalid Alfalqi. “A Survey of Topic Modeling in Text Mining.” International Journal
of Advanced Computer Science and Applications 6, no. 1 (2015): 147–153. https://doi.org/10.14569/
IJACSA.2015.060121.
Shi-Heng, Wang, Yijun Ding, Weizhong Zhao, Yung-Hsiang Huang, Roger Perkins, Wen Zou, and James Chen.
“Text Mining for Identifying Topics in the Literatures about Adolescent Substance Use and Depression.” BMC
Public Health 16 (2016). https://doi.org/10.1186/s12889-016-2932-1.
26
theolib.org
PEER REVIEW: A GENTLE INTRODUCTION
Suominen, Arho, and Hannes Toivanen. “Map of Science with Topic Modeling: Comparison of Unsupervised
Learning and Human-Assigned Subject Classification.” Journal of the Association for Information Science &
Technology 67, no. 10 (October 2016): 2464–76. https://doi.org/10.1002/asi.23596.
Upton, Graham, and Ian Cook. “Machine Learning.” In A Dictionary of Statistics. Oxford University Press, 2014.
http://www.oxfordreference.com.du.idm.oclc.org/view/10.1093/acref/9780199679188.001.0001/acref-
9780199679188-e-2380.
27
theolib.org