0% found this document useful (0 votes)
3 views84 pages

Data Mining

Information extraction is the process of extracting structured information from unstructured textual sources, enabling tasks like content classification and data mining. It involves several subtasks, including pre-processing text, finding and classifying concepts, and enriching knowledge bases. Semantically enhanced information extraction adds metadata to extracted concepts, facilitating advanced applications such as knowledge graph creation and improved data retrieval.

Uploaded by

gss_1987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views84 pages

Data Mining

Information extraction is the process of extracting structured information from unstructured textual sources, enabling tasks like content classification and data mining. It involves several subtasks, including pre-processing text, finding and classifying concepts, and enriching knowledge bases. Semantically enhanced information extraction adds metadata to extracted concepts, facilitating advanced applications such as knowledge graph creation and improved data retrieval.

Uploaded by

gss_1987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Information Extraction

It is the process of extracting information from unstructured textual sources to enable finding
entities as well as classifying and storing them in a database. Semantically enhanced
information extraction (also known as semantic annotation) couples those entities with their
semantic descriptions and connections from a knowledge graph. By adding metadata to the
extracted concepts, this technology solves many challenges in enterprise content
management and knowledge discovery.
Information extraction is the process of extracting specific (pre-specified) information from
textual sources. One of the most trivial examples is when your email extracts only the data
from the message for you to add in your Calendar.
Other free-flowing textual sources from which information extraction can distill structured
information are legal acts, medical records, social media interactions and streams, online
news, government documents, corporate reports and more.
Gathering detailed structured data from texts, information extraction enables:
The automation of tasks such as smart content classification, integrated search, management
and delivery;
Data-driven activities such as mining for patterns and trends, uncovering hidden relationships,
etc.

How Does Information Extraction Work?


There are many subtleties and complex techniques involved in the process of
information extraction, but a good start for a beginner is to remember:
To elaborate a bit on this minimalist way of describing information extraction, the
process involves transforming an unstructured text or a collection of texts into sets of
facts (i.e., formal, machine-readable statements of the type “Bukowski is the author
of Post Office“) that are further populated (filled) in a database (like an American
Literature database).

Typically, for structured information to be extracted from unstructured texts, the


following main subtasks are involved:

 Pre-processing of the text – this is where the text is prepared for processing with
the help of computational linguistics tools such as tokenization, sentence splitting,
morphological analysis(Morphemes can also be divided into inflectional or derivational
morphemes.), etc.
 Finding and classifying concepts – this is where mentions of people, things,
locations, events and other pre-specified types of concepts are detected and
classified.
 Connecting the concepts – this is the task of identifying relationships between the
extracted concepts.
 Unifying – this subtask is about presenting the extracted data into a standard form.
 Getting rid of the noise – this subtask involves eliminating duplicate data.
 Enriching your knowledge base – this is where the extracted knowledge is ingested
in your database for further use.
Information extraction can be entirely automated or performed with the help of human
input.

Typically, the best information extraction solutions are a combination of automated


methods and human processing.

An Example of Information Extraction


Consider the paragraph below (an excerpt from a news article about Valencia MotoGP
and Marc Marques):

Marc Marquez was fastest in the final MotoGP warm-up session of the 2016 season at
Valencia, heading Maverick Vinales by just over a tenth of a second.

After qualifying second on Saturday behind a rampant Jorge Lorenzo, Marquez took
charge of the 20-minute session from the start, eventually setting a best time of
1m31.095s at half-distance.

Through information extraction, the following basic facts can be pulled out of the free-
flowing text and organized in a structured, machine-readable form:

Person: Marc Marquez


Location: Valencia
Event: MotoGP
Related mentions: Maverick Vinales, Yamaha, Jorge Lorenzo

Adding Semantics to the Information Extraction


Process
While information extraction helps for finding entities, classifying and storing them in a
database, semantically enhanced information extraction couples those entities with their
semantic descriptions and connections from a knowledge graph. The latter is also known
as semantic annotation. Technically, semantic annotation adds metadata to the extracted
concepts, providing both class and instance information about them.

Semantic annotation is applicable for any sort of text – web pages, regular (non-web)
documents, text fields in databases, etc. Further knowledge acquisition can be
performed on the basis of extracting more complex dependencies – analysis of
relationships between entities, event and situation descriptions, etc.

Extending the existing practices of information extraction, semantic information


extraction enables new types of applications such as:

 highlighting, indexing and retrieval;


 categorization and generation of more advanced metadata;
 smooth traversal between unstructured text and available relevant knowledge.

The named entity recognition (NER) is one of the most popular data
preprocessing task. It involves the identification of key information in the text
and classification into a set of predefined categories. An entity is basically the
thing that is consistently talked about or refer to in the text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that are
involved:
 Detecting the entities from the text
 Classifying them into different categories
Some of the categories that are the most important architecture in NER such
that:
 Person
 Organization
 Place/ location
Other common tasks include classifying of the following:
date/time.
 expression
 Numeral measurement (money, percent, weight, etc)
 E-mail address
Ambiguity in NE
 For a person, the category definition is intuitively quite clear, but for
computers, there is some ambiguity in classification. Let’s look at some
ambiguous example:
 England (Organisation) won the 2019 world cup vs The 2019
world cup happened in England(Location).
 Washington(Location) is the capital of the US vs The first
president of the US was Washington(Person).

Relation Extraction (RE) is the task of extracting semantic


relationships from text, which usually occur between two or more
entities. These relations can be of different types. E.g “Paris is in
France” states a “is in” relationship from Paris to France. This can be
denoted using triples, (Paris, is in, France).

Information Extraction (IE) is the field of extracting structured


information from natural language text. This field is used for various
NLP tasks, such as creating Knowledge Graphs, Question-Answering
System, Text Summarization, etc. Relation extraction is in itself a
subfield of IE.
There are five different methods of doing Relation Extraction:

1. Rule-based RE
2. Weakly Supervised RE
3. Supervised RE
4. Distantly Supervised RE
5. Unsupervised RE

We will go through all of them at a high level, and discuss some pros
and cons which for each one.

Rule-based RE
Many instances of relations can be identified through hand-crafted
patterns, looking for triples (X, α, Y) where X are entities and α are
words in between. For the “Paris is in France” example, α=”is in”.
This could be extracted with a regular expression.
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there
exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent\'s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

Named entities in sentence

Part-of-speech tags in sentence

Only looking at keyword matches will also retrieve many false positive.
We can mitigate this by filtering on named entities, only retrieving
(CITY, is in, COUNTRY). We can also take into account the part-of-
speech (POS) tags to remove additional false positive.

These are examples of doing word sequence patterns, because the


rule specifies a pattern following the order of the text. Unfortunately
these type of rules fall apart for longer-range patterns and sequences
with greater variety. E.g. “Fred and Mary got married” cannot
successfully be handled by a word sequence pattern.

Dependency paths in sentence


Instead, we can make use of dependency paths in the sentences,
knowing which word is having a grammatical dependency on what
other word. This can greatly increase the coverage of the rule without
extra effort.

We can also transform the sentences before applying the rule. E.g.
“The cake was baked by Harry” or “The cake which Harry baked” can
be transformed into “Harry baked the cake”. Then we are changing
the order to work with our “linear rule”, while also removing
redundant modifying word in between.

Pros

 Humans can create pattern which tend to have high


precision
 Can be tailored to specific domains

Cons

 Human patterns are still often low-recall (too much variety


in languages)
 A lot of manual work to create all possible rules
 Have to create rules for every relation type

Weakly Supervised RE
The idea here is to start out with a set of hand-crafted rules and
automatically find new ones from the unlabeled text data, through and
iterative process (bootstrapping). Alternatively, one can start out with
a sed of seed tuples, describing entities with a specific relation. E.g.
seed={(ORG:IBM, LOC:Armonk), (ORG:Microsoft, LOC:Redmond)}
states entities having the relation “based in”.

Agichtein, Eugene, and Luis Gravano. “Snowball: Extracting relations from large plain-text collections.” Proceedings of
the fifth ACM conference on Digital libraries. ACM, 2000.

Snowball is a fairly old example of an algorithm which does this:

1. Start with a set of seed tuples (or extract a seed set from
the unlabeled text with a few hand-crafted rules).
2. Extract occurrences from the unlabeled text that matches
the tuples and tag them with a NER (named entity
recognizer).
3. Create patterns for these occurrences, e.g. “ORG is based
in LOC”.
4. Generate new tuples from the text, e.g. (ORG:Intel, LOC:
Santa Clara), and add to the seed set.
5. Go step 2 or terminate and use the patterns that were
created for further extraction

Pros
 More relations can be discovered than for Rule-based RE
(higher recall)
 Less human effort required (does only require a high
quality seed)

Cons

 The set of patterns become more error prone with each


iteration
 Must be careful when generating new patterns through
occurrences of tuples, e.g. “IBM shut down an office in
Hursley” could easily be caught by mistake when
generating of patterns for the “based in” relation
 New relation types require new seeds (which have to be
manually provided)

Supervised RE
A common way to do Supervised Relation Extraction is to train a
stacked binary classifier (or a regular binary classifier) to determine if
there is a specific relation between two entities. These classifiers take
features about the text as input, thus requiring the text to be
annotated by other NLP modules first. Typical features are: context
words, part-of-speech tags, dependency path between entities, NER
tags, tokens, proximity distance between words, etc.

We could train and extract by:

1. Manually label the text data according to if a sentence is


relevant or not for a specific relation type. E.g. for the
“CEO” relation:
“Apple CEO Steve Jobs said to Bill Gates.” is relevant
“Bob, Pie Enthusiast, said to Bill Gates.” is not relevant
2. Manually label the relevant sentences as positive/negative
if they are expressing the relation. E.g. “Apple CEO Steve
Jobs said to Bill Gates.”:
(Steve Jobs, CEO, Apple) is positive
(Bill Gates, CEO, Apple) is negative
3. Learn a binary classifier to determine if the sentence is
relevant for the relation type
4. Learn a binary classifier on the relevant sentences to
determine if the sentence expresses the relation or not
5. Use the classifiers to detect relations in new text data.

Some choose to not train a “relevance classifier”, and instead let a


single binary classifier determine both things in one go.

Pros

 High quality supervision (ensuring that the relations that


are extracted are relevant)
 We have explicit negative examples

Cons

 Expensive to label examples


 Expensive/difficult to add new relations (need to train a
new classifier)
 Does not generalize well to new domains
 Is only feasible for a small set of relation types
Distantly Supervised RE
We can combine the idea of using seed data, as for Weakly Supervised
RE, with training a classifier, as for Supervised RE. However, instead
of providing a set of seed tuples ourselves we can take it from an
existing Knowledge Base (KB), such as Wikipedia, DBpedia, Wikidata,
Freebase, Yago.

Distantly Supervised RE schema

1. For each relation type we are interest in the KB


2. For each tuple of this relation in the KB
3. Select sentences from our unlabeled text data that match
these tuples (both words of the tuple is cooccurring in the
sentence), and assume that these sentences are positive
examples for this relation type
4. Extract features from these sentences (e.g. POS, context
words, etc)
5. Train a supervised classifier on this

Pros

 Less manual effort


 Can scale to use large amount of labeled data and many
relations
 No iterations required (compared to Weakly Supervised
RE)

Cons

 Noisy annotation of training corpus (sentences that have


both words in the tuple may actually not describe the
relation)
 There are no explicit negative examples (this can be
tackled by matching unrelated entities)
 Is restricted to the Knowledge Base
 May require careful tuning to the task

Unsupervised RE
Here we extract relations from text without having to label any
training data, provide a set of seed tuples or having to write rules to
capture different types of relations in the text. Instead we rely on a set
of very general constraints and heuristics. It could be argued if this is
truly unsupervised, since we are using “rules” which are at a more
general level. Also, for some cases even leveraging small sets of
labeled text data to design and tweak the systems. Never the less,
these systems tend to require less supervision in general. Open
Information Extraction (Open IE) generally refers to this paradigm.
TextRunner algorithm. Bach, Nguyen, and Sameer Badaskar. “A review of relation extraction.” Literature review for
Language and Statistics II 2 (2007).

TextRunner is an algorithm which belongs to these kinds of RE


solutions. Its algorithm can be described as:

1. Train a self-supervised classifier on a small corpus

 For each parsed sentence, find all pairs of noun phrases


(X, Y) with a sequence of words r connecting them. Label
them as positive examples if they meet all of the
constraints, otherwise label them as negative examples.
 Map each triple (X, r, Y) to a feature vector representation
(e.g. incorporating POS tags, number of stop words in r,
NER tag, etc.)
 Train a binary classifier to identify trustworthy candidates

2. Pass over the entire corpus and extract possible relations

 Fetch potential relations from the corpus


 Keep/discard candidates according to if the classifier
considers them as trustworthy or not

3. Rank-based assessment of relations based on text


redundancy

 Normalize (omit non-essential modifiers) and merge


relations that are same
 Count the number of distinct sentences the relations are
present in and assign probabilities to each relation

OpenIE 5.0 and Stanford OpenIE are two open-source systems that
does this. They are more modern than TextRunner (which was just
used here to demonstrate the paradigm). We can expect a lot of
different relationship types as output from systems like these (since
we do not specify what kind of relations we are interested in).

Pros

 No/almost none labeled training data required


 Does not require us to manually pre-specify each relation
of interest, instead it considers all possible relation types

Cons
 Performance of the system depends a lot on how well
constructed the constraints and heuristics are
 Relations are not as normalized as pre-specified relation
type

Text summarization is the process of generating short, fluent, and


most importantly accurate summary of a respectively longer text
document. The main idea behind automatic text summarization is to be
able to find a short subset of the most essential information from the
entire set and present it in a human-readable format. As online textual
data grows, automatic text summarization methods have the potential
to be very helpful because more useful information can be read in a
short time.

Why automatic text summarization?

1. Summaries reduce reading time.


2. When researching documents, summaries make the
selection process easier.
3. Automatic summarization improves the effectiveness of
indexing.
4. Automatic summarization algorithms are less biased than
human summarization.
5. Personalized summaries are useful in question-answering
systems as they provide personalized information.
6. Using automatic or semi-automatic summarization systems
enables commercial abstract services to increase the
number of text documents they are able to process.
Type of summarization:

Based on input type:

1. Single Document, where the input length is short. Many of


the early summarization systems dealt with single-
document summarization.
2. Multi-Document, where the input can be arbitrarily long.

Based on the purpose:

1. Generic, where the model makes no assumptions about the


domain or content of the text to be summarized and treats
all inputs as homogeneous. The majority of the work that
has been done revolves around generic summarization.
2. Domain-specific, where the model uses domain-specific
knowledge to form a more accurate summary. For
example, summarizing research papers of a specific
domain, biomedical documents, etc.
3. Query-based, where the summary only contains
information that answers natural language questions about
the input text.
Based on output type:

1. Extractive, where important sentences are selected from


the input text to form a summary. Most summarization
approaches today are extractive in nature.
2. Abstractive, where the model forms its own phrases and
sentences to offer a more coherent summary, like what a
human would generate. This approach is definitely more
appealing, but much more difficult than extractive
summarization.

How to do text summarization

 Text cleaning
 Sentence tokenization
 Word tokenization
 Word-frequency table
 Summarization

Text cleaning:
# !pip instlla -U spacy# !python -m spacy download en_core_web_smimport
spacyfrom spacy.lang.en.stop_words import STOP_WORDSfrom string import
punctuationstopwords = list(STOP_WORDS)nlp = spacy.load(‘en_core_web_sm’)doc
= nlp(text)

Word tokenization:
tokens = [token.text for token in doc]print(tokens)punctuation = punctuation
+ ‘\n’punctuationword_frequencies = {}for word in doc:if word.text.lower()
not in stopwords:if word.text.lower() not in punctuation:if word.text not in
word_frequencies.keys():word_frequencies[word.text] =
1else:word_frequencies[word.text] += 1print(word_frequencies)

Sentence tokenization:
max_frequency = max(word_frequencies.values())max_frequencyfor word in
word_frequencies.keys():word_frequencies[word] =
word_frequencies[word]/max_frequencyprint(word_frequencies)sentence_tokens =
[sent for sent in doc.sents]print(sentence_tokens)
Word frequency table:
sentence_scores = {}for sent in sentence_tokens:for word in sent:if
word.text.lower() in word_frequencies.keys():if sent not in
sentence_scores.keys():sentence_scores[sent] =
word_frequencies[word.text.lower()]else:sentence_scores[sent] +=
word_frequencies[word.text.lower()]sentence_scores

Summarization:
from heapq import nlargestselect_length =
int(len(sentence_tokens)*0.3)select_lengthsummary = nlargest(select_length,
sentence_scores, key = sentence_scores.get)summaryfinal_summary = [word.text
for word in summary]summary = ‘ ‘.join(final_summary)

Input:
text = “””Maria Sharapova has basically no friends as tennis players on the
WTA Tour. The Russian player has no problems in openly speaking about it and
in a recent interview she said: ‘I don’t really hide any feelings too much.I
think everyone knows this is my job here. When I’m on the courts or when I’m
on the court playing, I’m a competitor and I want to beat every single person
whether they’re in the locker room or across the net.So I’m not the one to
strike up a conversation about the weather and know that in the next few
minutes I have to go and try to win a tennis match.I’m a pretty competitive
girl. I say my hellos, but I’m not sending any players flowers as well. Uhm,
I’m not really friendly or close to many players.I have not a lot of friends
away from the courts.’ When she said she is not really close to a lot of
players, is that something strategic that she is doing? Is it different on
the men’s tour than the women’s tour? ‘No, not at all.I think just because
you’re in the same sport doesn’t mean that you have to be friends with
everyone just because you’re categorized, you’re a tennis player, so you’re
going to get along with tennis players.I think every person has different
interests. I have friends that have completely different jobs and interests,
and I’ve met them in very different parts of my life.I think everyone just
thinks because we’re tennis players we should be the greatest of friends. But
ultimately tennis is just a very small part of what we do.There are so many
other things that we’re interested in, that we do.’“””

Output(final summary): summary


I think just because you’re in the same sport doesn’t mean that you have to
be friends with everyone just because you’re categorized, you’re a tennis
player, so you’re going to get along with tennis players. Maria Sharapova has
basically no friends as tennis players on the WTA Tour. I have friends that
have completely different jobs and interests, and I’ve met them in very
different parts of my life. I think everyone just thinks because we’re tennis
players So I’m not the one to strike up a conversation about the weather and
know that in the next few minutes I have to go and try to win a tennis match.
When she said she is not really close to a lot of players, is that something
strategic that she is doing?
Topic Representation Approaches

Topic words
This common technique aims to identify words that describe the topic
of the input document. An advance of the initial Luhn’s idea was to use
log-likelihood ratio test to identify explanatory words known as
the “topic signature”. Generally speaking, there are two ways to
compute the importance of a sentence: as a function of the number of
topic signatures it contains, or as the proportion of the topic
signatures in the sentence. While the first method gives higher scores
to longer sentences with more words, the second one measures the
density of the topic words.

Frequency-driven approaches
This approach uses frequency of words as indicators of importance.
The two most common techniques in this category are: word
probability and TFIDF (Term Frequency Inverse Document
Frequency). The probability of a word w is determined as the number
of occurrences of the word, f (w), divided by the number of all words in
the input (which can be a single document or multiple documents).
Words with highest probability are assumed to represent the topic of
the document and are included in the summary. TFIDF, a more
sophisticated technique, assesses the importance of words and
identifies very common words (that should be omitted from
consideration) in the document(s) by giving low weights to words
appearing in most documents. TFIDF has given way to centroid-based
approaches that rank sentences by computing their salience using a
set of features. After creation of TFIDF vector representations of
documents, the documents that describe the same topic are clustered
together and centroids are computed — pseudo-documents that
consist of the words whose TFIDF scores are higher than a certain
threshold and form the cluster. Afterwards, the centroids are used to
identify sentences in each cluster that are central to the topic.

Latent Semantic Analysis


Latent semantic analysis (LSA) is an unsupervised method for
extracting a representation of text semantics based on observed
words. The first step is to build a term-sentence matrix, where each
row corresponds to a word from the input (n words) and each column
corresponds to a sentence. Each entry of the matrix is the weight of
the word i in sentence j computed by TFIDF technique. Then singular
value decomposition (SVD) is used on the matrix that transforms the
initial matrix into three matrices: a term-topic matrix having weights
of words, a diagonal matrix where each row corresponds to the weight
of a topic, and a topic-sentence matrix. If you multiply the diagonal
matrix with weights with the topic-sentence matrix, the result will
describe how much a sentence represent a topic, in other words, the
weight of the topic i in sentence j.

Discourse Based Method


A logical development of analyzing semantics, is perform discourse
analysis, finding the semantic relations between textual units, to form
a summary. The study on cross-document relations was initiated by
Radev, who came up with Cross-Document Structure Theory (CST)
model. In his model, words, phrases or sentences can be linked with
each other if they are semantically connected. CST was indeed useful
for document summarization to determine sentence relevance as well
as to treat repetition, complementarity and inconsistency among the
diverse data sources. Nonetheless, the significant limitation of this
method is that the CST relations should be explicitly determined by
human.

Bayesian Topic Models


While other approaches do not have very clear probabilistic
interpretations, Bayesian topic models are probabilistic models that
thanks to their describing topics in more detail can represent the
information that is lost in other approaches. In topic modeling of text
documents, the goal is to infer the words related to a certain topic and
the topics discussed in a certain document, based on the prior analysis
of a corpus of documents. It is possible with the help of Bayesian
inference that calculates the probability of an event based on a
combination of common sense assumptions and the outcomes of
previous related events. The model is constantly improved by going
through many iterations where a prior probability is updated with
observational evidence to produce a new posterior probability.

Indicator representation approaches


The second large group of techniques aims to represent the text based
on a set of features and use them to directly rank the sentences
without representing the topics of the input text.

Graph Methods
Influenced by PageRank algorithm, these methods represent
documents as a connected graph, where sentences form the vertices
and edges between the sentences indicate how similar the two
sentences are. The similarity of two sentences is measured with the
help of cosine similarity with TFIDF weights for words and if it is
greater than a certain threshold, these sentences are connected. This
graph representation results in two outcomes: the sub-graphs included
in the graph create topics covered in the documents, and the
important sentences are identified. Sentences that are connected to
many other sentences in a sub-graph are likely to be the center of the
graph and will be included in the summary Since this method do not
need language-specific linguistic processing, it can be applied to
various languages [43]. At the same time, such measuring only of the
formal side of the sentence structure without the syntactic and
semantic information limits the application of the method.

Machine Learning
Machine learning approaches that treat summarization as a
classification problem are widely used now trying to apply Naive
Bayes, decision trees, support vector machines, Hidden Markov
models and Conditional Random Fields to obtain a true-to-life
summary. As it has turned out, the methods explicitly assuming the
dependency between sentences (Hidden Markov
model and Conditional Random Fields) often outperform other
techniques.

Figure 1: Summary Extraction Markov Model to Extract 2 Lead Sentences and


Additional Supporting Sentences
Figure 2: Summary Extraction Markov Model to Extract 3 Sentences

Yet, the problem with classifiers is that if we utilize supervised


learning methods for summarization, we need a set of labeled
documents to train the classifier, meaning development of a corpus. A
possible way-out is to apply semi-supervised approaches that combine
a small amount of labeled data along with a large amount of unlabeled
data in training.

Overall, machine learning methods have proved to be very effective


and successful both in single and multi-document summarization,
especially in class-specific summarization such as drawing scientific
paper abstracts or biographical summaries.

Though abundant, all the summarization methods we have mentioned


could not produce summaries that would similar to human-created
summaries. In many cases, the soundness and readability of created
summaries are not satisfactory, because they fail to cover all the
semantically relevant aspects of data in an effective way and
afterwards they fail to connect sentences in a natural way.

Chapter 2
A typical approach to relation extraction is to treat the task as a classification
problem [38, 71, 37, 18, 19]. Specifically, any pair of entities
co-occurring in the same sentence is considered a candidate relation instance.
The goal is to assign a class label to this instance where the class
label is either one of the predefined relation types or nil for unrelated
entity pairs. Alternatively, a two-stage classification can be performed
where at the first stage whether two entities are related is determined
and at the second stage the relation type for each related entity pair is
determined.
Classification approach assumes that a training corpus exists in which
all relation mentions for each predefined relation type have been manually
annotated. These relation mentions are used as positive training
examples. Entity pairs co-occurring in the same sentence but not labeled
are used as negative training examples. Each candidate relation instance
is represented by a set of features that are carefully chosen. Standard
learning algorithms such as support vector machines and logistic regression
can then be used to train relation classifiers.
Feature engineering is a critical step for this classification approach.
Researchers have examined a wide range of lexical, syntactic and semantic
features. We summarize some of the most commonly used features
as follows:
Entity features: Oftentimes the two argument entities, including
the entity words themselves and the entity types, are correlated
with certain relation types. In the ACE data sets, for example,
entity words such as father, mother, brother and sister and the
person entity type are all strong indicators of the family relation
subtype.
Lexical contextual features: Intuitively the contexts surrounding
the two argument entities are important. The simplest way to
incorporate evidence from contexts is to use lexical features. For
example, if the word founded occurs between the two arguments,
they are more likely to have the FounderOf relation.
Syntactic contextual features: Syntactic relations between the
two arguments or between an argument and another word can often
be useful. For example, if the first argument is the subject of the
verb founded and the second argument is the object of the verb
founded, then one can almost immediately tell that the FounderOf
relation exists between the two arguments. Syntactic features can
be derived from parse trees of the sentence containing the relation
instance.
Background knowledge: Chan and Roth studied the use of
background knowledge for relation extraction [18]. An example is
to make use of Wikipedia. If two arguments co-occur in the same
Wikipedia article, the content of the article can be used to check
whether the two entities are related. Another example is word
clusters. For example, if we can group all names of companies such
as IBM and Apple into the same word cluster, we achieve a level
of abstraction higher than words and lower than the general entity
type organization. This level of abstraction may help extraction
of certain relation types such as Acquire between two companies.
Jiang and Zhai proposed a framework to organize the features used
for relation extraction such that a systematic exploration of the feature
space can be conducted [37]. Specifically, a relation instance is represented
as a labeled, directed graph G = (V,E, A,B), where V is the set
of nodes in the graph, E is the set of directed edges in the graph, and
A and B are functions that assign labels to the nodes.
First, for each node v ∈ V , A(v) = {a1, a2, . . . , a|A(v)|} is a set of attributes
associated with node v, where ai ∈ Σ, and Σ is an alphabet that
contains all possible attribute values. For example, if node v represents
a token, then A(v) can include the token itself, its morphological base
form, its part-of-speech tag, etc. If v also happens to be the head word
of arg1 or arg2, then A(v) can also include the entity type. Next, function
B : V → {0, 1, 2, 3} is introduced to distinguish argument nodes
from non-argument nodes. For each node v ∈ V , B(v) indicates how
node v is related to arg1 and arg2. 0 indicates that v does not cover any
argument, 1 or 2 indicates that v covers arg1 or arg2, respectively, and 3
indicates that v covers both arguments. In a constituency parse tree, a node v may represent a
phrase and it can possibly cover both arguments.
Figures 2.4, 2.5 and 2.6 show three relation instance graphs based on the
token sequence, the constituency parse tree and the dependency parse
tree, respectively.
An example sequence representation. The subgraph on the left represents
a bigram feature. The subgraph on the right represents a unigram feature that states the
entity type of arg2.

It can be shown that many features that have been explored in previous
work on relation extraction can be transformed into this graphic
representation. Figures 2.4, 2.5 and 2.6 show some examples.
This framework allows a systematic exploration of the feature space
for relation extraction. To explore the feature space, Jiang and Zhai considered
three levels of small unit features in increasing order of their complexity:
unigram features, bigram features and trigram features. They
found that a combination of features at different levels of complexity
and from different sentence representations, coupled with task-oriented
feature pruning, gave the best performance.

Kernel model
Sequence,
Composite
sequence based
Text clustering. Source: Kunwar 2013.

The amount of text data being generated in the recent years has exploded exponentially. It's essential for
organizations to have a structure in place to mine actionable insights from the text being generated. From social
media analytics to risk management and cybercrime protection, dealing with textual data has never been more
important.

Text clustering is the task of grouping a set of unlabelled texts in such a way that texts in the same cluster are
more similar to each other than to those in other clusters. Text clustering algorithms process text and determine
if natural clusters (groups) exist in the data.

Discussion

 What's the principle behind text clustering?

Semantically similar sentences. Source: Yang and Tar 2018.

The big idea is that documents can be represented numerically as vectors of features. The similarity in
text can be compared by measuring the distance between these feature vectors. Objects that are near
each other should belong to the same cluster. Objects that are far from each other should belong to
different clusters.

Essentially, text clustering involves three aspects:

o Selecting a suitable distance measure to identify the proximity of two feature vectors.

o A criterion function that tells us that we've got the best possible clusters and stop further processing.

o An algorithm to optimize the criterion function. A greedy algorithm will start with some initial clustering
and refine the clusters iteratively.

 What are the use cases of text clustering?

Applications of text clustering. Source: Nabi 2018.

We note a few use cases:

o Document Retrieval: To improve recall, start by adding other documents from the same cluster.

o Taxonomy Generation: Automatically generate hierarchical taxonomies for browsing content.

o Fake News Identification: Detect if a news is genuine or fake.

o Language Translation: Translation of a sentence from one language to another.

o Spam Mail Filtering: Detect unsolicited and unwanted email/messages.

o Customer Support Issue Analysis: Identify commonly reported support issues.


 How is text clustering different from text classification?

Clustering is unsupervised whereas classification is supervised. Source: Valcheva 2018.

Classification is a supervised learning approach that maps an input to an output based on example input-
output pairs. Clustering is a unsupervised learning approach.

o Classification: If the prediction value tends to be category like yes/no or positive/negative, then it falls
under classification type problem in machine learning. The different classes are known in advance. For
example, given a sentence, predict whether it's a negative or positive review.

o Clustering: Clustering is the task of partitioning the dataset into groups called clusters. The goal is to
split up the data in such a way that points within single cluster are very similar and points in different
clusters are different. It determines grouping among unlabelled data.

 What are the types of clustering?


Hard versus soft clustering. Source: Withanawasam 2015.

Broadly, clustering can be divided into two groups:

o Hard Clustering: This groups items such that each item is assigned to only one cluster. For example, we
want to know if a tweet is expressing a positive or negative sentiment. k-means is a hard clustering
algorithm.

o Soft Clustering: Sometimes we don't need a binary answer. Soft clustering is about grouping items such
that an item can belong to multiple clusters. Fuzzy C Means (FCM) is a soft clustering algorithm.

 What are the steps involved in text clustering?

Any text clustering approach involves broadly the following steps:

o Text pre-processing: Text can be noisy, hiding information between stop words, inflexions and sparse
representations. Pre-processing makes the dataset easier to work with.

o Feature Extraction: One of the commonly used technique to extract the features from textual data is
calculating the frequency of words/tokens in the document/corpus.

o Clustering: We can then cluster different text documents based on the features we have generated.

 What are the steps involved in text pre-processing?

Below are the main components involved in pre-processing.

o Tokenization: Tokenization is the process of parsing text data into smaller units (tokens) such as words
and phrases.

o Transformation: It converts the text to lowercase, removes all diacritics/accents in the text, and parses
html tags.

o Normalization: Text normalization is the process of transforming a text into a canonical (root) form.
Stemming and lemmatization techniques are used for deriving the root word.

o Filtering: Stop words are common words used in a language, such as 'the', 'a', 'on', 'is', or 'all'. These
words do not carry important meaning for text clustering and are usually removed from texts.

 What are the levels of text clustering?

Text clustering can be document level, sentence level or word level.

o Document level: It serves to regroup documents about the same topic. Document clustering has
applications in news articles, emails, search engines, etc.
o Sentence level: It's used to cluster sentences derived from different documents. Tweet analysis is an
example.

o Word level: Word clusters are groups of words based on a common theme. The easiest way to build a
cluster is by collecting synonyms for a particular word. For example, WordNet is a lexical database for
the English language that groups English words into sets of synonyms called synsets.

 How do I define or extract textual features for clustering?

BOW with word as feature. Source: Hoonlor 2011, fig. 2.1.

In general, words can be used to represent a common class of feature. Word characteristics are also
features. For example, capitalization matters: US versus us, White House versus white house. Part of
speech and grammatical structure also add to textual features. Semantics can be a textual feature: buy
versus purchase.

The mapping from textual data to real-valued vectors is called feature extraction. One of the simplest
techniques to numerically represent text is Bag of Words (BOW). In BOW, we make a list of unique
words in the text corpus called vocabulary. Then we can represent each sentence or document as a
vector, with each word represented as 1 for presence and 0 for absence.
Another representation is to count the number of times each word appears in a document. The most
popular approach is using the Term Frequency-Inverse Document Frequency (TF-IDF) technique.

More recently, word embeddings are being used to map words into feature vectors. A popular model for
word embeddings is word2vec.

 How can I measure similarity in text clustering?

Words can be similar lexically or semantically:

o Lexical similarity: Words are similar lexically if they have a similar character sequence. Lexical similarity
can be measured using string-based algorithms that operate on string sequences and character
composition.

o Semantic similarity: Words are similar semantically if they have the same meaning, are opposite of each
other, used in the same way, used in the same context or one is a type of another. Semantic similarity
can be measured using corpus-based or knowledge-based algorithms.

Some of the metrics for computing similarity between two pieces of text are Jaccard coefficient, cosine
similarity and Euclidean distance.
 Which are some common text clustering algorithms?

Some types of text clustering algorithms. Source: Khosla et al. 2019, fig. 4.

Ignoring neural network models, we can identify different types:

o Hierarchical: In the divisive approach, we start with one cluster and split that into sub-clusters. Example
algorithms include DIANA and MONA. In the agglomerative approach, each document starts as its own
cluster and then we merge similar ones into bigger clusters. Examples include BIRCH and CURE.

o Partitioning: k-means is a popular algorithm but requires the right choice of k. Other examples are
ISODATA and PAM.

o Density: Instead of using a distance measure, we form clusters based on how many data points fall
within a given radius. DBSCAN is the most well-known algorithm.

o Graph: Some algorithms have made use of knowledge graphs to assess document similarity. This
addresses the problem of polysemy (ambiguity) and synonymy (similar meaning).
o Probabilistic: A cluster of words belong to a topic and the task is to identify these topics. Words also
have probabilities that they belong to a topic. Topic Modelling is a separate NLP task but it's similar to
soft clustering. pLSA and LDA are example topic models.

 How can I evaluate the efficiency of a text clustering algorithm?

Internal quality measure: more compact clusters on the left. Source: Hassani and Seidl 2016.

Measuring the quality of a clustering algorithm has shown to be as important as the algorithm itself. We
can evaluate it in two ways:

o External quality measure: External knowledge is required for measuring the external quality. For
example, we can conduct surveys of users of the application that includes text clustering.

o Internal quality measure: The evaluation of the clustering is compared only with the result itself, that is,
the structure of found clusters and their relations to one another. Two main concepts are compactness
and separation. Compactness measures how closely data points are grouped in a
cluster. Separation measures how different the found clusters are from each other. More formally,
compactness is intra-cluster variance whereas separation is inter-cluster distance.

 What are the common challenges involved in text clustering?

Document clustering is being studied for many decades. It's far from trivial or a solved problem. The
challenges include the following:

o Selecting appropriate features of documents that should be used for clustering.

o Selecting an appropriate similarity measure between documents.

o Selecting an appropriate clustering method utilising the above similarity measure.

o Implementing the clustering algorithm in an efficient way that makes it feasible in terms of memory
and CPU resources.

o Finding ways of assessing the quality of the performed clustering.


Milestones
1971

Vector space model. Source: Perone 2013.

Text mining research in general relies on a vector space model. Salton first proposes it to model text documents
as vectors. Features are considered to be the words in the document collection and feature values come from
different term weighting schemes, the most popular of which is the Term Frequency-Inverse Document
Frequency (TF-IDF).
1983

Massart et al. in the book The Interpretation of Analytical Chemical Data by the Use of Cluster
Analysis introduces various clustering methods, including hierarchical and non-hierarchical methods. They
show how clustering can be used to interpret large quantities of analytical data. They discuss how clustering is
related to other pattern recognition techniques.
1992

Cutting et al. adapt partition-based clustering algorithms to cluster documents. Two of the techniques are
Buckshot and Fractionation. Buckshot selects a small sample of documents to pre-cluster them using a standard
clustering algorithm and assigns the rest of the documents to the clusters formed. Fractionation finds k centres
by initially breaking N documents into N/m buckets of a fixed size m > k. Each cluster is then treated as if it's
an individual document and the whole process is repeated until there are only K clusters.
1997
Huang introduces k-modes, an extension to the well-known k-means algorithm for clustering numerical data.
By defining the mode notion for categorical clusters and introducing an incremental update rule for cluster
modes, the algorithm preserves the scaling properties of k-means. Naturally, it also inherits its disadvantages,
such as dependence on the seed clusters and the inability to automatically detect the number of clusters.
2008

Sun et al. develop a novel hierarchal algorithm for document clustering. They use cluster overlapping
phenomenon to design cluster merging criteria. The system computes the overlap rate in order to improve time
efficiency.

o cluster, or not to cluster


Clustering is one of the biggest topics in data science, so big that you
will easily find tons of books discussing every last bit of it. The
subtopic of text clustering is no exception. This article can therefore
not deliver an exhaustive overview, but it covers the main aspects.
This being said, let us start by getting on common ground what
clustering is and what it isn’t.

You just scrolled by clusters!


In fact, clusters are nothing more than groups that contain
similar objects. Clustering is the process used for
separating the objects into these groups.

Objects inside of a cluster should be as similar as possible. Objects in


different clusters should be as dissimilar as possible. But who defines
what “similar” means? We’ll come back to that at a later point.

Now, you may have heard of classification before. When classifying


objects, you also put them into different groups, but there are a few
important differences. Classifying means putting new, previously
unseen objects into groups based on objects of which the group
affiliation is already known, so called training data. This means we
have something reliable to compare new objects to — when clustering,
we start with a blank canvas: all objects are new! Because of that, we
call classification a supervised method, clustering
an unsupervised one.

This also means that for classifying the correct number of groups is
known, whereas in clustering there is no such number. Note that it is
not just unknown — it simply does not exist. It is up to us to choose a
suitable amount of clusters for our purpose. Many times, this means
trying out a few and then choosing the one which delivered the best
results.
Kinds of clustering methods
Before we dive right into concrete clustering algorithms, let us first
establish some ways in which we can describe and distinguish them.
There are a few ways in which this is possible:

In hard clustering, every object belongs to exactly one cluster.


In soft clustering, an object can belong to one or more clusters. The
membership can be partial, meaning the objects may belong to certain
clusters more than to others.

In hierarchical clustering, clusters are iteratively combined in a


hierarchical manner, finally ending up in one root (or super-cluster, if
you will). You can also look at a hierarchical clustering as a binary
tree. All clustering methods not following this principle can simply be
described as flat clustering, but are sometimes also called non-
hierarchical or partitional. You can always convert a hierarchical
clustering into a flat one by “cutting” the tree horizontally on a level of
your choice.
The tree-shaped diagram of a hierarchical clustering is called a dendrogram.
Objects connected on a lower hierarchy are more similar than objects
connected high up the tree.

Hierarchical methods can be further divided into two


subcategories. Agglomerative (“bottom up”) methods start by putting
each object into its own cluster and then keep unifying
them. Divisive (“top down”) methods do the opposite: they start from
the root and keep dividing it until only single objects are left.

The clustering process


It should be clear how the clustering process looks like, right? You
take some data, apply the clustering algorithm of your choice and ta-
da, you are done! While this might theoretically be possible, it is
usually not the case. Especially when working with text, there are
several steps you have to take prior to and after clustering. In reality
the process of clustering text is often messy and marked by many
unsuccessful trials. However, if you tried to draw it in an idealized,
linear manner, it might look like this:
Quite a few extra steps, right? Don’t worry — you would probably have
intuitively done it right anyway. However, it is helpful to consider each
step on its own and keep in mind that alternative options for solving
the problem might exist.

Clustering words
Congratulations! You have made it past the introduction. In the next
few paragraphs, we will look at clustering methods for words. Let’s
look at the following set of them:

To us it immediately becomes apparent which words belong together.


There should obviously be one cluster with animals containing the
words Aardvark and Zebra and one with adverbs
containing on and under. But is it equally obvious for a computer?

When talking about words with similar meaning, you often read about
the distributional hypothesis in linguistics. This hypothesis states
that words bearing a similar meaning will appear between similar
word contexts. You could say “The box is on the shelf.”, but also “The
box is under the shelf.” and still produce a meaningful
sentence. On and under are interchangeable up to a certain extent.
This hypothesis is utilized when creating word embeddings. Word
embeddings map each word of a vocabulary onto a n-dimensional
vector space. Words that have similar contexts will appear roughly in
the same area of the vector space. One of these embeddings was
developed by Weston, Ratle & Collobert in 2008. You can see an
interesting segment of the word vectors (reduced to two dimensions
with t-SNE) here:

Source:
Joseph Turian
, see the full picture here

Notice how neatly months, names and locations are grouped together.
This will come in handy for clustering them in the next step. To learn
more about how exactly word embeddings are created and the
interesting properties they have, take a look at this Medium article by
Hunter Heidenreich
. It also includes information about more advanced word embeddings
like word2vec.
k-means
We will now look at the most famous vector-based clustering algorithm
out there: k-means. What k-means does is returning a cluster
assignment to one of k possible clusters for each object. To
recapitulate what we learned earlier it is a hard, flat
clustering method. Let’s see how the k-means process looks like:

Source: Chire, via Wikipedia (CC-BY-SA)

K-means assigns k random points in the vector space as initial, virtual


means of the k clusters. It then assigns each data point to the nearest
cluster mean. Next, the actual mean of each cluster is recalculated.
Based on the shift of the means the data points are reassigned. This
process repeats itself until the means of the clusters stop moving
around.

To get a more intuitive and visual understanding of what k-means


does, watch this short video by Josh Starmer.

K-means it not the only vector based clustering method out there.
Other often used methods include DBSCAN, a method favoring
densely populated clusters and expectation maximization (EM), a
method that assumes an underlying probabilistic distribution for each
cluster.

Brown clustering
There are also methods for clustering words that do not require the
words to already be available as vectors. Probably the most cited such
technique is brown clustering, proposed in 1992 by Brown et al. (not
related to the brown corpus, which is named after Brown University,
Rhode Island).

Brown clustering is a hierarchical clustering method. If cut at the


right levels in the tree, it also results in beautiful, flat clusters such as
the following:
adapted from: Brown et al. (1992)

You can also look at small sub-trees and find clusters that contain
word pairs close to synonymity such
as evaluation and assessment or conversation and discussion.

adapted from: Brown et al. (1992)

How is this achieved? Again, this method relies on the distributional


hypothesis. It introduces a quality function describing how well the
surrounding context words predict the occurrence of the words in the
current cluster (so called mutual information). It then follows the
following procedure:

1. Initialize by assigning every word to its own, unique


cluster.
2. Until only one cluster (the root) is left: Merge the two
clusters of which the produced union has the best quality
function value.

This is the reason, why evaluation and assessment are merged so


early. Since both appear in extremely similar contexts, the quality
function from above still delivers a very good value. The fact that we
start out with single element clusters that are gradually unified means
this method is agglomerative.

Brown clustering is still used today! In this publication by Owoputi et


al. (2013) brown clustering was used to find new clusters of words in
online conversational language deserving their own part of speech tag.
The results are entertaining, yet accurate:

Source: Owoputi et al. (2013)


If you are interested in learning more about how brown clustering
works, I can strongly recommend watching this lecture given by
Michael Collins at Columbia University.

While this paragraph concludes our section about clustering words,


there are many more approaches out there not discussed in this
article. One very promising and efficient way of clustering words is
graph-based clustering, also called spectral clustering. Methods
used include minimal spanning tree based clustering, Markov chain
clustering and Chinese whispers.

Clustering documents
In general, clustering documents can also be done by looking at each
document in vector format. But documents rarely have contexts. You
could imagine a book standing next to other books in a tidy shelf, but
usually this is not what large collections of digital documents (so-
called corpora) look like.

The fastest (and arguably most trivial) way to vectorize a document is


to give each word in the dictionary its own vector dimension and then
just count the occurrences for each word and each document. This
way of looking at documents without considering the word order is
called the bag of words approach. The Oxford English Dictionary
contains over 300,000 main entries, not counting homographs. That’s
a lot of dimensions, and most of them will probably get the value zero
(or how often do you read the
words lackadaisical, peristeronic and amatorculist?).

Up to a certain extent you can counteract this by removing all word


dimensions that are not used in your document collection (corpus), but
you will still end up with lots of dimensions. And if you suddenly see a
new document containing a word previously not used you will have to
update every single document vector, adding this new dimension and
the value zero for it. So, for the sake of simplicity, let’s assume our
document collection does not grow.

tf-idf
Look at the following toy example containing only two short
documents d1 and d2 and the resulting bag of words vectors:

What you can see is that words that are not very specific
like I and love get rewarded with the same value as the words actually
discerning the two documents like pizza and chocolates. A way to
counteract this behavior is to use tf-idf, a numerical statistic used as a
weighting factor dampening the effects of less important words.

Tf-idf stands for term frequency and inverse document frequency, the
two factors used for weighting. The term frequency is simply the
number of occurrences of a word in a specific document. If our
document is “I love chocolates and chocolates love me”, the term
frequency of the word love would be two. This value is often
normalized by dividing it by the highest term frequency in the given
document, resulting in term frequency values between 0 (for words
not appearing in the document) and 1 (for the most frequent word in
the document). The term frequencies are calculated per word and
document.

The inverse document frequency, on the other hand, is only


calculated per word. It indicates how frequently a word appears in the
entire corpus. This value is inversed by taking the logarithm of it.
Remember the ubiquitous word I we wanted to get rid of? Since the
logarithm of one is zero, its influence is completely eliminated.

Based on these formulas, we get the following values for our toy
example:
Looking at the last two columns, we see that only the most relevant
words receive a high tf-idf value. So-called stop words, meaning
words that are ubiquitous in our document collection, get a value of or
close to 0.

The received tf-idf vectors are still as high dimensional as the original
bag of words vectors. Therefore, dimensionality reduction techniques
such as latent semantic indexing (LSI) are often used to make them
easier to handle. Algorithms such as k-means, DBSCAN and EM can be
used on document vectors, too, just as described earlier for word
clustering. Possible distance measures include euclidean and cosine
distance.

Latent Dirichlet allocation (LDA)


Often, just having clusters of documents is not enough.

Topic modeling algorithms are statistical methods that analyze


the words of the original texts to discover the themes that run
through them, how those themes are connected to each other,
and how they change over time (Blei, 2012).

All topic models are based on the same basic assumption:

 each document consists of a distribution over topics, and


 each topic consists of a distribution over words.

Besides other topic models such as probabilistic latent semantic


analysis (pLSA), latent Dirichlet allocation (LDA) is the best known
and widest used one. Just by looking at its name, we can already find
out a lot about how it works.

Latent refers to hidden variables, a Dirichlet distribution is a


probability distribution over other probability distributions
and allocation means that some values are allocated based on the
two. To better understand how these three aspects come to play, let’s
look at what results LDA gives us. The following topics are an excerpt
of 100 topics uncovered by fitting a LDA model to 17,000 articles from
the journal Science.

adapted from: Blei (2012)


How should you interpret these topics? A topic is a probability
distribution over words. Some words are more likely to appear in a
topic, some less. What you see above is the 10 most frequent words
per topic, excluding stop words. It is important to note that the topics
don’t actually have the names Genetics or Evolution. These are just
terms we humans would use to summarize what the topic is about. Try
to look at the topics as word probability distribution 1 or word
probability distribution 23 instead.

But this is not all that LDA provides us with. Additionally, it tells us for
each document which topics appear in it and to which percentage. For
example, an article about a new device which can detect a specific
disease prevalence in your DNA may consist of the topic mix
48% Disease, 31% Genetics and 21% Computers.

To understand how we can make a computer know what good topics


look like and how to find them, we will again construct a little toy
example. Let’s pretend our corpus vocabulary consists only of a couple
of emojis.

Possible topics or word probability distributions on this vocabulary


might look like this:
As humans, we might identify these topics as Foods,
Smileys and Animals. Let’s assume the topics are given. To understand
the underlying assumptions LDA makes, let’s look at the generative
process of documents. Even if they don’t seem very realistic, an author
is assumed to take the following steps:

1. Choose how many words you want to write in your text.


2. Select a mixture of topics your text should cover.
3. For each word in the document:

 draw a topic which the word should relate to from the


mixture
 draw a word from the selected topics word distribution

Based on these steps, the following document may be the result:

Note that even though a pizza emoji is much less likely to be drawn
from topic 3, it is still possible to originate from it. Now that we have
the result we want, we only have to find a way to reverse this process.
Only? In reality, we are faced with this:
In theory, we could make our computer try out every possible
combination of words and topics. Besides the fact that this would
probably take an eternity, how do we know in the end which
combination makes sense and which one doesn’t? For this, the
Dirichlet distribution comes in handy. Instead of drawing the
distribution like in the image above, let’s draw our document on a
topic simplex in the respective position.

We can also draw a lot of other documents from our corpus next to
this one. It could look like this:
Or, if we had chosen other topics beforehand, like this:
While in the first variant, the documents are clearly discernible and
empathize different topics, the documents in the second variant are all
more or less alike. The topics chosen here were not able to separate
the documents in a meaningful way. These two possible document
distributions are nothing else than two different Dirichlet
distributions! This means we have found a way of describing what
“good” distributions over topics look like!

The same principle applies to words in topics. Good topics will have
different distributions over words, while bad topics will have about the
same words as others. These two appearances of Dirichlet
distributions are described in the LDA model by two hyperparameters,
alpha and beta. As a general rule, you want to keep your Dirichlet
parameters below one. Take a look at how different values change the
distribution in this brilliant animation made by David
Lettier
.

Source: David Lettier


Good word and topic distributions are usually approximated by using
techniques such as collapsed Gibbs sampling or expectation
propagation. Both methods iteratively improve randomly initialized
word and topic distributions. However, a “perfect” allocation may
never be found.

If you want to try out what LDA does to a data set of your choice
interactively, click your way through this great in-browser demo by
David Mimno.

Remember the clustering flowchart presented in the introduction?


Especially the last step called final evaluation? When using clustering
methods you should always keep in mind that even though a specific
model may results in the least vector mean movements or the lowest
probability distribution value, this doesn’t mean that it is “correct”.
There are many data sets k-means cannot cluster properly and even
LDA can produce topics that don’t make any sense to humans.

An example constituency parse tree representation. The subgraph represents


a subtree feature (grammar production feature).

Feature Transformation is the process of converting raw data which


can be of Text, Image, Graph, Time series etc… into numerical feature
(Vectors). So that we can perform all algebraic operation on it.

Text data usually consists of documents which can represent words,


sentences or even paragraphs of free flowing text. The inherent
unstructured (no neatly formatted data columns!) and noisy nature of
textual data makes it harder for machine learning methods to directly
work on raw text data. Hence, in this article, we will explore some of
the most popular and effective strategies for transforming text data
into feature vectors. These features can then be used in building
machine learning or deep learning models easily.

Feature Transform Technique's


1. Bag of Words (BOW)

2. Term Frequency and Inverse document Frequency (TFIDF)

3. Word Embedding using Embedding Layer

4. Word to Vectors

Important point to note here is before performing any of the above


mentioned Feature Transformation Technique its mandatory to
perform text preprocessing to standardize data, remove noise and
reduce dimension. Here is my blog on text preprocessing.

Bag of Words (BOW)


This is one of the most simple vector space representational model for
unstructured text. A vector space model is simply a mathematical
model to represent unstructured text (or any other data) as numeric
vectors, such that each dimension of the vector is a specific feature\
attribute. The bag of words model represents each text document as a
numeric vector where each dimension is a specific word from the
corpus and the value could be its frequency in the document,
occurrence (denoted by 1 or 0) or even weighted values. The model’s
name is such because each document is represented literally as a ‘bag’
of its own words, disregarding word orders, sequences and grammar.

Lets say our corpus have 3 documents as follows


I. This movie is very scary and long
II. This movie is not scary and is slow
III. This movie is spooky and good

we need to design the Vocabulary, i.e. list of all unique the


words(ignoring case and punctuation).so we end up with following
words ‘this’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’,
‘spooky’, ‘good’. This words individually represents a dimension in
vector space.

Because we know the vocabulary has 11 words, we can use a fixed-


length of 11 to represent each document, with one position in the
vector to score each word. The simplest scoring method is to mark the
presence of words as a boolean value, 0 for absent, 1 for present. For
example “this movie is very scary and long” can be represented as [1 1
1 1 1 1 1 0 0 0 0].Same way we can represent all the document and
arrive at document term matrix as below.

Document term matrix

Some additional simple scoring methods include:


. Counts - Count the number of times each word appears in a
document.
. Frequencies - Calculate the frequency that each word appears in a
document out of all the words in the document.

Code snippet
Sklearn library will do all computation for us just in a single line of
code. We can handle vocabulary size, remove stopwords, handle
scoring methods(binary or count), building the vocabulary ignore
terms that have a document frequency higher than or lower than the
given threshold etc. using Sklearn. Have a look at sklearn’s detailed
documentation here .

Drawbacks of using a Bag-of-Words (BoW) Model

In the above example, we can have vectors of length 11. However, we


start facing issues when we come across new sentences:
1. If the new sentences contain new words, then our vocabulary size
would increase and thereby, the length of the vectors would increase
too.
2. Additionally, the vectors would also contain many 0s, thereby
resulting in a sparse matrix (which is what we would like to avoid).
3. We are retaining no information on the grammar of the sentences
nor on the ordering of the words in the text.

2. Term Frequency and Inverse document Frequency (TFIDF)


There are some potential problems which might arise with the Bag of
Words model when it is used on large corpora. Since the feature
vectors are based on absolute term frequencies, there might be some
terms which occur frequently across all documents and these may
tend to overshadow other terms in the feature set. The TF-IDF model
tries to combat this issue by using a scaling or normalizing factor in its
computation. TF-IDF stands for Term Frequency-Inverse Document
Frequency, which uses a combination of two metrics in its
computation, namely: term frequency (tf) and inverse document
frequency (idf).
 Term Frequency: It is a measure of how frequently a
term ‘t’ appears in a document ‘d’.

 Inverse Document Frequency: It is a measure of how


important a term is. We need the IDF value because
computing just the TF alone is not sufficient to understand
the importance of words

The scores are a weighting where not all words are equally as
important or interesting. The scores have the effect of highlighting
words that are distinct (contain useful information) in a given
document. Rare the word like ‘Supine’, ‘Idyllic’ get highly scored
compared to common words like ‘the’, ‘is’ , ‘and’ etc.

Below we will see how to compute tf-idf for document “This movie is
not scary and is slow”

 Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’,


‘not’, ‘slow’, ‘spooky’, ‘good’
 Number of words in Document 2 = 8
 TF for the word ‘this’ = (number of times ‘this’ appears in
review 2)/(number of terms in review 2) = 1/8
 IDF(‘this’) = log(number of documents/number of
documents containing the word ‘this’) = log(3/3) = log(1)
=0
 TF-IDF = TF x IDF
TF-IDF(‘this’, Doc 2) = TF(‘this’, Doc 2) * IDF(‘this’) = 1/8
*0=0

Similarly we compute tf-idf values for all the words in document and
the final tf-idf vector looks as below for document 2

Tf-idf Vector of document 2 is highlighted in red border

Code snippet
Same as CountVectorizer TfidfVectorizer can handle text
preprocessing steps like stopword removal, lowercase , vocabulary
size etc.
You can see that TfidfVector which we derived for doc2 varies from
TfidfVector got out of Sklearn function.

It is because TF-IDF computed in Scikit-learn’s differ slightly from the


standard textbook notation.

Different Notation
TF remains the same, while IDF is different:“ Some constant 1” is
added to the numerator and denominator of the IDF as if an extra
document was seen containing every term in the collection exactly
once, which prevents zero divisions” a more empirical approach, while
the standard notation present in most textbooks doesn’t have the
constant 1. For more detailed explanation and step by step derivation
have a loot at this blog.

3. Word Embedding using Embedding Layer

A word embedding is a class of approaches for representing words and


documents using a dense vector representation. It is an improvement
over the traditional bag-of-word model encoding schemes where large
sparse vectors were used to represent each word or to score each
word within a vector to represent an entire vocabulary. These
representations were sparse because the vocabularies were vast and a
given word or document would be represented by a large vector
comprised mostly of zero values.

Instead in an embedding, words are represented by dense vectors


where a vector represents the projection of the word into a continuous
vector space.
e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]].A word embedding can be
learned as part of a deep learning model. This can be a slower
approach, but tailors the model to a specific training dataset.

Keras offers an Embedding layer that can be used for neural networks
on text data. It requires that the input data be integer encoded, so that
each word is represented by a unique integer. The Embedding layer is
initialized with random weights and will learn an embedding for all of
the words in the training dataset.

In the code snippet we are passing vectors of shape (2, 4)(consider it


as 2 text document whose length is 4 has been encoded). This input is
passed to an embedding layer which creates a vector of shape (1, 3)
for each index which results in an output shape of (2, 4, 3).

You can copy paste the code snippet in your python notebook, modify
the input shape and Embedding layer’s input_length argument and
play with it for better understanding. For lot more examples
click here.

It is a flexible layer that can be used in a variety of ways, such as:

 It can be used alone to learn a word embedding that can


be saved and used in another model later.
 It can be used as part of a deep learning model where the
embedding is learned along with the model itself.
 It can be used to load a pre-trained word embedding
model, a type of transfer learning.

4. Word to Vectors

Traditional approaches to NLP, such as bag-of-words models and TF-


IDF models does not capture information about a word’s meaning or
context. This means that potential relationships such as contextual
closeness are not captured across collections of words. For example, a
bag-of-words cannot capture simple relationships such as determining
that the words “dog” and “cat” both refer to animals that are often
discussed in the context of household pets. Such encodings often
provide sufficient baselines for simple NLP tasks like email spam
classifiers, but lack the sophistication for more complex tasks such as
translation and speech recognition. Traditional approaches to NLP
such as bag-of-words do not capture syntactic (structure) and
semantic (meaning) relationships across collections of words and
therefore represent language in a very naïve way.

In contrast, word vectors represent words as multidimensional


continuous floating point numbers where semantically similar words
are mapped to proximate points in geometric space. In simpler
terms, a word vector is a row of real valued numbers where each
point captures a dimension of the word’s meaning and where
semantically similar words have similar vectors. This means that
words such as wheel and engine should have similar word vectors to
the word car because of the similarity of their meanings, whereas the
word banana should be quite distant.

The beauty of representing words as vectors is that they lend


themselves to mathematical operators. word vector captures male
female relationships, verbe tense relationship, country capital
relationship and so on.
Word2Vec is a method to construct such an embedding. It can be
obtained using two methods both involving Neural Networks: Skip
Gram and Common Bag Of Words (CBOW). For deep dive how to train
word2Vec have a look at this blog.

Instead of training a model from scratch we can use Google’s pre-


trained model. The pre-trained Google word2vec model was trained on
Google news data (about 100 billion words); it contains 3 million
words and phrases and was fit using 300-dimensional word vectors.

The code snippet shows that when we subtract vector for man from
vector of king and add vector of women the closest word vector we get
is queen.
king — man + woman = queen.The “man-ness” in king is replaced
with “woman-ness” to give us queen.
In data mining and statistics, hierarchical clustering analysis is a method of
clustering analysis that seeks to build a hierarchy of clusters i.e. tree-type
structure based on the hierarchy.
In machine learning, clustering is the unsupervised learning technique that
groups the data based on similarity between the set of data. There are different-
different types of clustering algorithms in machine learning. Connectivity-
based clustering: This type of clustering algorithm builds the cluster based on
the connectivity between the data points. Example: Hierarchical clustering
 Centroid-based clustering: This type of clustering algorithm forms
around the centroids of the data points. Example: K-Means clustering, K-
Mode clustering
 Distribution-based clustering: This type of clustering algorithm is
modeled using statistical distributions. It assumes that the data points in
a cluster are generated from a particular probability distribution, and the
algorithm aims to estimate the parameters of the distribution to group
similar data points into clusters Example: Gaussian Mixture Models
(GMM)
 Density-based clustering: This type of clustering algorithm groups
together data points that are in high-density concentrations and
separates points in low-concentrations regions. The basic idea is that it
identifies regions in the data space that have a high density of data
points and groups those points together into clusters.
Example: DBSCAN(Density-Based Spatial Clustering of Applications with
Noise)
In this article, we will discuss connectivity-based clustering algorithms i.e
Hierarchical clustering

Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups the
data points together that are close to each other based on the measure of
similarity or distance. The assumption is that data points that are close to each
other are more similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the
hierarchical relationships between groups. Individual data points are located at
the bottom of the dendrogram, while the largest clusters, which include all the
data points, are located at the top. In order to generate different numbers of
clusters, the dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a
measure of similarity or distance between data points. Clusters are divided or
merged repeatedly until all data points are contained within a single cluster, or
until the predetermined number of clusters is attained.
We can look at the dendrogram and measure the height at which the branches
of the dendrogram form distinct clusters to calculate the ideal number of
clusters. The dendrogram can be sliced at this height to determine the number
of clusters.
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative
clustering (HAC). A structure that is more informative than the unstructured set
of clusters returned by flat clustering. This clustering algorithm does not require
us to prespecify the number of clusters. Bottom-up algorithms treat each data as
a singleton cluster at the outset and then successively agglomerate pairs of
clusters until all clusters have been merged into a single cluster that contains all
data.
Algorithm :
given a dataset (d 1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[d i, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
Hierarchical Agglomerative Clustering

Steps:
 Consider each alphabet as a single cluster and calculate the distance of
one cluster from all the other clusters.
 In the second step, comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to
each other therefore we merge them in the second step similarly to
cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
 We recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
 Repeating the same process; The clusters DEF and BC are comparable
and merged together to form a new cluster. We’re now left with clusters
[(A), (BCDEF)].
 At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
Python implementation of the above algorithm using the scikit-learn library:

 Python3

from sklearn.cluster import AgglomerativeClustering


import numpy as np

# randomly chosen dataset

X = np.array([[1, 2], [1, 4], [1, 0],

[4, 2], [4, 4], [4, 0]])

# here we need to mention the number of clusters

# otherwise the result will be a single cluster

# containing all the data

clustering = AgglomerativeClustering(n_clusters=2).fit(X)

# print the class labels

print(clustering.labels_)

Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to
prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters
recursively until individual data have been split into singleton clusters.
Algorithm :
given a dataset (d 1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Hierarchical Divisive clustering

Computing Distance Matrix


While merging two clusters we check the distance between two every pair of
clusters and merge the pair with the least distance/most similarity. But the
question is how is that distance determined. There are different ways of defining
Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the
cluster.
2. Max Distance: Find the maximum distance between any two points of
the cluster.
3. Group Average: Find the average distance between every two points of
the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in
squared error when two clusters are merged.
For example, if we group a given data using different methods, we may get
different results:
Distance Matrix Comparision in Hierarchical Clustering

Implementations code

 Python3

import numpy as np

from scipy.cluster.hierarchy import dendrogram, linkage

import matplotlib.pyplot as plt

# randomly chosen dataset

X = np.array([[1, 2], [1, 4], [1, 0],

[4, 2], [4, 4], [4, 0]])

# Perform hierarchical clustering

Z = linkage(X, 'ward')

# Plot dendrogram

dendrogram(Z)

plt.title('Hierarchical Clustering Dendrogram')

plt.xlabel('Data point')

plt.ylabel('Distance')

plt.show()
Output:

Hierarchical Clustering Dendrogram

Hierarchical Agglomerative vs Divisive Clustering


 Divisive clustering is more complex as compared to agglomerative
clustering, as in the case of divisive clustering we need a flat clustering
method as “subroutine” to split each cluster until we have each data
having its own singleton cluster.
 Divisive clustering is more efficient if we do not generate a complete
hierarchy all the way down to individual data leaves. The time
complexity of a naive agglomerative clustering is O(n3) because we
exhaustively scan the N x N matrix dist_mat for the lowest distance in
each of N-1 iterations. Using priority queue data structure we can reduce
this complexity to O(n2logn). By using some more optimizations it can
be brought down to O(n2). Whereas for divisive clustering given a fixed
number of top levels, using an efficient flat algorithm like K-Means,
divisive algorithms are linear in the number of patterns and clusters.
 A divisive algorithm is also more accurate. Agglomerative clustering
makes decisions by considering the local patterns or neighbor points
without initially taking into account the global distribution of data. These
early decisions cannot be undone. whereas divisive clustering takes into
consideration the global distribution of data when making top-level
partitioning decisions.
Topic modelling in natural language processing is a technique which
assigns topic to a given corpus based on the words present. Topic
modelling is important, because in this world full of data it has become
increasingly important to categories the documents. For example, a
company receives hundred of reviews, then it is important for the
company to know what categories of reviews are more important and
vice versa.

In this article, we will see the following:

1. LDA
2. Hyperparameters in LDA
3. LDA in Python
4. Shortcomings of LDA
5. Alternative

Topics can be thought of as keywords which can describe a document,


for example, for a topic sports the words that come to our mind our
volleyball, basketball, tennis, cricket etc. A topic model is a model,
which can automatically detect topics based on the words appearing in
a document.

It is important to note that topic modelling is different to topic


classification. Topic classification is a supervised learning while topic
modelling is a unsupervised learning algorithm.

Some of the well known topic modelling techniques are

1. Latent Semantic Analysis (LSA)


2. Probabilistic Latent Semantic Analysis (PLSA)
3. Latent Dirichlet Allocation (LDA)
4. Correlated Topic Model (CTM)

In this article, we will focus on LDA

Topic Modelling. image from pyGotham

Latent Dirichlet Allocation


LDA, short for Latent Dirichlet Allocation is a technique used for topic
modelling. First, let us break down the word and understand what
does LDA mean. Latent means hidden, something that is yet to be
found. Dirichlet indicates that the model assumes that the topics in the
documents and the words in those topics follow a Dirichlet
distribution. Allocation means to giving something, which in this case
are topics.
LDA. Image by Kim et al.

LDA assumes that the documents are generated using a statistical


generative process, such that each document is a mixture of topics,
and each topics are a mixture of words.

In the following figure, Document is made up of 10 words, which can


be grouped into 3 different topics, and the three topics have their own
describing words.
Document Generation Assumption. Image from my great learning.

The general steps in the LDA are as follows

Image from my great learning


Hyperparameters in LDA
There are three hyperparameters in LDA

1. α → document density factor


2. β → topic word density factor
3. K → number of topics selected

The α hyperparameter controls the number of topic expected in the


document. The β hyperparameter controls the distribution of words
per topic in the document, and K defines how many topics we need to
extract.

LDA in Python
Let us look at an implementation of LDA. We will try to extract topics
from a set of reviews.

The dataset that we will be working on a set of reviews, which looks as


follows:

dataset

Feature Extraction:

This step is not related to LDA, please free to skip to vectorization.


First, we will do feature extraction to get some meaningful insights of
the data.

We have extracted the following features

1. Number of words in a document


2. Number of characters in a document
3. Average word length of the document
4. Number of stop-words present
5. Number of numeric characters
6. Number of upper count characters
7. The polarity sentiment

Data cleaning and Preprocessing:

In data cleaning and preprocessing, we have done the following

1. Made all the characters to lower case


2. Expanded the short forms, like I’ll → I will
3. Removed special characters
4. Removed extra and trailing spaces
5. Removed accented characters and replaced them with
their alternative
6. Lemmatized the words
7. Removed stop words

Vectorization:

Since LDA has an inbuilt TF-IDF vectorizer, we will have to use Count
vectorizer.
Latent Dirichlet Allocation:

In this example, we were given the number of topics so we did not


have to tune the hyperparameter k but for times that we do not know
what the number of topics is, we can use Grid search.

This can be done as follows

The Grid search looks as follows

The motivation for our model as follows:

 Since we know the number of topics, we will be using


Latent Dirichlet Allocation with number of topics at 12.
 We will also not be needing to compare different models to
get best number of topic
 We will use random_state, so that the results can be
reproduced
 We will be fitting the model into the vectorized data, and
transform it on the same
 After fitting the model, we will print the top 10 words of
each topic
 After getting the topics, we will be creating a new column
and assign the topic

Topic Assignments:

To assign the topics we can do the following,

1. See the word-clouds of each topic


2. See the top 10 words
3. Look for KERA → Keyword Extraction for Reports and
Articles

To make word clouds, we can simply import the WordCloud library.

To know more about KERA, the paper “Exploratory Analysis of Highly


Heterogeneous Document Collections” by Maiya et al can be referred
from this link, its on arXiv.

The abstract is as follows

We present an effective multifaceted system for exploratory


analysis of highly heterogeneous document collections. Our
system is based on intelligently tagging individual documents in a
purely automated fashion and exploiting these tags in a powerful
faceted browsing framework. Tagging strategies employed
include both unsupervised and supervised approaches based on
machine learning and natural language processing. As one of our
key tagging strategies, we introduce the KERA algorithm
(Keyword Extraction for Reports and Articles). KERA extracts
topic-representative terms from individual documents in a purely
unsupervised fashion and is revealed to be significantly more
effective than state-of-the-art methods. Finally, we evaluate our
system in its ability to help users locate documents pertaining to
military critical technologies buried deep in a large
heterogeneous sea of information.

Problems in the model:


 We had to assign the topics with the provided topics,
manually, which can cause errors
 Could not check if the topics assigned is correct or not
 Only one topic is assigned, while ideally it should depend
on what matches the best.
 In some documents, all the topics has same probability
which will cause problems, as we are selecting only the
max
 Some of words had no relation with the topic, such
as discount, change in date

Shortcomings of LDA:
1. LDA performs poorly on small texts; most of our data was
short.
2. Since the reviews are not coherent, LDA finds it all the
more difficult to
identify the topics
3. Since the reviews are mainly context-based, hence word
co-occurrences
based models fail.

You might also like