Data Mining
Data Mining
It is the process of extracting information from unstructured textual sources to enable finding
entities as well as classifying and storing them in a database. Semantically enhanced
information extraction (also known as semantic annotation) couples those entities with their
semantic descriptions and connections from a knowledge graph. By adding metadata to the
extracted concepts, this technology solves many challenges in enterprise content
management and knowledge discovery.
Information extraction is the process of extracting specific (pre-specified) information from
textual sources. One of the most trivial examples is when your email extracts only the data
from the message for you to add in your Calendar.
Other free-flowing textual sources from which information extraction can distill structured
information are legal acts, medical records, social media interactions and streams, online
news, government documents, corporate reports and more.
Gathering detailed structured data from texts, information extraction enables:
The automation of tasks such as smart content classification, integrated search, management
and delivery;
Data-driven activities such as mining for patterns and trends, uncovering hidden relationships,
etc.
Pre-processing of the text – this is where the text is prepared for processing with
the help of computational linguistics tools such as tokenization, sentence splitting,
morphological analysis(Morphemes can also be divided into inflectional or derivational
morphemes.), etc.
Finding and classifying concepts – this is where mentions of people, things,
locations, events and other pre-specified types of concepts are detected and
classified.
Connecting the concepts – this is the task of identifying relationships between the
extracted concepts.
Unifying – this subtask is about presenting the extracted data into a standard form.
Getting rid of the noise – this subtask involves eliminating duplicate data.
Enriching your knowledge base – this is where the extracted knowledge is ingested
in your database for further use.
Information extraction can be entirely automated or performed with the help of human
input.
Marc Marquez was fastest in the final MotoGP warm-up session of the 2016 season at
Valencia, heading Maverick Vinales by just over a tenth of a second.
After qualifying second on Saturday behind a rampant Jorge Lorenzo, Marquez took
charge of the 20-minute session from the start, eventually setting a best time of
1m31.095s at half-distance.
Through information extraction, the following basic facts can be pulled out of the free-
flowing text and organized in a structured, machine-readable form:
Semantic annotation is applicable for any sort of text – web pages, regular (non-web)
documents, text fields in databases, etc. Further knowledge acquisition can be
performed on the basis of extracting more complex dependencies – analysis of
relationships between entities, event and situation descriptions, etc.
The named entity recognition (NER) is one of the most popular data
preprocessing task. It involves the identification of key information in the text
and classification into a set of predefined categories. An entity is basically the
thing that is consistently talked about or refer to in the text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that are
involved:
Detecting the entities from the text
Classifying them into different categories
Some of the categories that are the most important architecture in NER such
that:
Person
Organization
Place/ location
Other common tasks include classifying of the following:
date/time.
expression
Numeral measurement (money, percent, weight, etc)
E-mail address
Ambiguity in NE
For a person, the category definition is intuitively quite clear, but for
computers, there is some ambiguity in classification. Let’s look at some
ambiguous example:
England (Organisation) won the 2019 world cup vs The 2019
world cup happened in England(Location).
Washington(Location) is the capital of the US vs The first
president of the US was Washington(Person).
1. Rule-based RE
2. Weakly Supervised RE
3. Supervised RE
4. Distantly Supervised RE
5. Unsupervised RE
We will go through all of them at a high level, and discuss some pros
and cons which for each one.
Rule-based RE
Many instances of relations can be identified through hand-crafted
patterns, looking for triples (X, α, Y) where X are entities and α are
words in between. For the “Paris is in France” example, α=”is in”.
This could be extracted with a regular expression.
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there
exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent\'s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Only looking at keyword matches will also retrieve many false positive.
We can mitigate this by filtering on named entities, only retrieving
(CITY, is in, COUNTRY). We can also take into account the part-of-
speech (POS) tags to remove additional false positive.
We can also transform the sentences before applying the rule. E.g.
“The cake was baked by Harry” or “The cake which Harry baked” can
be transformed into “Harry baked the cake”. Then we are changing
the order to work with our “linear rule”, while also removing
redundant modifying word in between.
Pros
Cons
Weakly Supervised RE
The idea here is to start out with a set of hand-crafted rules and
automatically find new ones from the unlabeled text data, through and
iterative process (bootstrapping). Alternatively, one can start out with
a sed of seed tuples, describing entities with a specific relation. E.g.
seed={(ORG:IBM, LOC:Armonk), (ORG:Microsoft, LOC:Redmond)}
states entities having the relation “based in”.
Agichtein, Eugene, and Luis Gravano. “Snowball: Extracting relations from large plain-text collections.” Proceedings of
the fifth ACM conference on Digital libraries. ACM, 2000.
1. Start with a set of seed tuples (or extract a seed set from
the unlabeled text with a few hand-crafted rules).
2. Extract occurrences from the unlabeled text that matches
the tuples and tag them with a NER (named entity
recognizer).
3. Create patterns for these occurrences, e.g. “ORG is based
in LOC”.
4. Generate new tuples from the text, e.g. (ORG:Intel, LOC:
Santa Clara), and add to the seed set.
5. Go step 2 or terminate and use the patterns that were
created for further extraction
Pros
More relations can be discovered than for Rule-based RE
(higher recall)
Less human effort required (does only require a high
quality seed)
Cons
Supervised RE
A common way to do Supervised Relation Extraction is to train a
stacked binary classifier (or a regular binary classifier) to determine if
there is a specific relation between two entities. These classifiers take
features about the text as input, thus requiring the text to be
annotated by other NLP modules first. Typical features are: context
words, part-of-speech tags, dependency path between entities, NER
tags, tokens, proximity distance between words, etc.
Pros
Cons
Pros
Cons
Unsupervised RE
Here we extract relations from text without having to label any
training data, provide a set of seed tuples or having to write rules to
capture different types of relations in the text. Instead we rely on a set
of very general constraints and heuristics. It could be argued if this is
truly unsupervised, since we are using “rules” which are at a more
general level. Also, for some cases even leveraging small sets of
labeled text data to design and tweak the systems. Never the less,
these systems tend to require less supervision in general. Open
Information Extraction (Open IE) generally refers to this paradigm.
TextRunner algorithm. Bach, Nguyen, and Sameer Badaskar. “A review of relation extraction.” Literature review for
Language and Statistics II 2 (2007).
OpenIE 5.0 and Stanford OpenIE are two open-source systems that
does this. They are more modern than TextRunner (which was just
used here to demonstrate the paradigm). We can expect a lot of
different relationship types as output from systems like these (since
we do not specify what kind of relations we are interested in).
Pros
Cons
Performance of the system depends a lot on how well
constructed the constraints and heuristics are
Relations are not as normalized as pre-specified relation
type
Text cleaning
Sentence tokenization
Word tokenization
Word-frequency table
Summarization
Text cleaning:
# !pip instlla -U spacy# !python -m spacy download en_core_web_smimport
spacyfrom spacy.lang.en.stop_words import STOP_WORDSfrom string import
punctuationstopwords = list(STOP_WORDS)nlp = spacy.load(‘en_core_web_sm’)doc
= nlp(text)
Word tokenization:
tokens = [token.text for token in doc]print(tokens)punctuation = punctuation
+ ‘\n’punctuationword_frequencies = {}for word in doc:if word.text.lower()
not in stopwords:if word.text.lower() not in punctuation:if word.text not in
word_frequencies.keys():word_frequencies[word.text] =
1else:word_frequencies[word.text] += 1print(word_frequencies)
Sentence tokenization:
max_frequency = max(word_frequencies.values())max_frequencyfor word in
word_frequencies.keys():word_frequencies[word] =
word_frequencies[word]/max_frequencyprint(word_frequencies)sentence_tokens =
[sent for sent in doc.sents]print(sentence_tokens)
Word frequency table:
sentence_scores = {}for sent in sentence_tokens:for word in sent:if
word.text.lower() in word_frequencies.keys():if sent not in
sentence_scores.keys():sentence_scores[sent] =
word_frequencies[word.text.lower()]else:sentence_scores[sent] +=
word_frequencies[word.text.lower()]sentence_scores
Summarization:
from heapq import nlargestselect_length =
int(len(sentence_tokens)*0.3)select_lengthsummary = nlargest(select_length,
sentence_scores, key = sentence_scores.get)summaryfinal_summary = [word.text
for word in summary]summary = ‘ ‘.join(final_summary)
Input:
text = “””Maria Sharapova has basically no friends as tennis players on the
WTA Tour. The Russian player has no problems in openly speaking about it and
in a recent interview she said: ‘I don’t really hide any feelings too much.I
think everyone knows this is my job here. When I’m on the courts or when I’m
on the court playing, I’m a competitor and I want to beat every single person
whether they’re in the locker room or across the net.So I’m not the one to
strike up a conversation about the weather and know that in the next few
minutes I have to go and try to win a tennis match.I’m a pretty competitive
girl. I say my hellos, but I’m not sending any players flowers as well. Uhm,
I’m not really friendly or close to many players.I have not a lot of friends
away from the courts.’ When she said she is not really close to a lot of
players, is that something strategic that she is doing? Is it different on
the men’s tour than the women’s tour? ‘No, not at all.I think just because
you’re in the same sport doesn’t mean that you have to be friends with
everyone just because you’re categorized, you’re a tennis player, so you’re
going to get along with tennis players.I think every person has different
interests. I have friends that have completely different jobs and interests,
and I’ve met them in very different parts of my life.I think everyone just
thinks because we’re tennis players we should be the greatest of friends. But
ultimately tennis is just a very small part of what we do.There are so many
other things that we’re interested in, that we do.’“””
Topic words
This common technique aims to identify words that describe the topic
of the input document. An advance of the initial Luhn’s idea was to use
log-likelihood ratio test to identify explanatory words known as
the “topic signature”. Generally speaking, there are two ways to
compute the importance of a sentence: as a function of the number of
topic signatures it contains, or as the proportion of the topic
signatures in the sentence. While the first method gives higher scores
to longer sentences with more words, the second one measures the
density of the topic words.
Frequency-driven approaches
This approach uses frequency of words as indicators of importance.
The two most common techniques in this category are: word
probability and TFIDF (Term Frequency Inverse Document
Frequency). The probability of a word w is determined as the number
of occurrences of the word, f (w), divided by the number of all words in
the input (which can be a single document or multiple documents).
Words with highest probability are assumed to represent the topic of
the document and are included in the summary. TFIDF, a more
sophisticated technique, assesses the importance of words and
identifies very common words (that should be omitted from
consideration) in the document(s) by giving low weights to words
appearing in most documents. TFIDF has given way to centroid-based
approaches that rank sentences by computing their salience using a
set of features. After creation of TFIDF vector representations of
documents, the documents that describe the same topic are clustered
together and centroids are computed — pseudo-documents that
consist of the words whose TFIDF scores are higher than a certain
threshold and form the cluster. Afterwards, the centroids are used to
identify sentences in each cluster that are central to the topic.
Graph Methods
Influenced by PageRank algorithm, these methods represent
documents as a connected graph, where sentences form the vertices
and edges between the sentences indicate how similar the two
sentences are. The similarity of two sentences is measured with the
help of cosine similarity with TFIDF weights for words and if it is
greater than a certain threshold, these sentences are connected. This
graph representation results in two outcomes: the sub-graphs included
in the graph create topics covered in the documents, and the
important sentences are identified. Sentences that are connected to
many other sentences in a sub-graph are likely to be the center of the
graph and will be included in the summary Since this method do not
need language-specific linguistic processing, it can be applied to
various languages [43]. At the same time, such measuring only of the
formal side of the sentence structure without the syntactic and
semantic information limits the application of the method.
Machine Learning
Machine learning approaches that treat summarization as a
classification problem are widely used now trying to apply Naive
Bayes, decision trees, support vector machines, Hidden Markov
models and Conditional Random Fields to obtain a true-to-life
summary. As it has turned out, the methods explicitly assuming the
dependency between sentences (Hidden Markov
model and Conditional Random Fields) often outperform other
techniques.
Chapter 2
A typical approach to relation extraction is to treat the task as a classification
problem [38, 71, 37, 18, 19]. Specifically, any pair of entities
co-occurring in the same sentence is considered a candidate relation instance.
The goal is to assign a class label to this instance where the class
label is either one of the predefined relation types or nil for unrelated
entity pairs. Alternatively, a two-stage classification can be performed
where at the first stage whether two entities are related is determined
and at the second stage the relation type for each related entity pair is
determined.
Classification approach assumes that a training corpus exists in which
all relation mentions for each predefined relation type have been manually
annotated. These relation mentions are used as positive training
examples. Entity pairs co-occurring in the same sentence but not labeled
are used as negative training examples. Each candidate relation instance
is represented by a set of features that are carefully chosen. Standard
learning algorithms such as support vector machines and logistic regression
can then be used to train relation classifiers.
Feature engineering is a critical step for this classification approach.
Researchers have examined a wide range of lexical, syntactic and semantic
features. We summarize some of the most commonly used features
as follows:
Entity features: Oftentimes the two argument entities, including
the entity words themselves and the entity types, are correlated
with certain relation types. In the ACE data sets, for example,
entity words such as father, mother, brother and sister and the
person entity type are all strong indicators of the family relation
subtype.
Lexical contextual features: Intuitively the contexts surrounding
the two argument entities are important. The simplest way to
incorporate evidence from contexts is to use lexical features. For
example, if the word founded occurs between the two arguments,
they are more likely to have the FounderOf relation.
Syntactic contextual features: Syntactic relations between the
two arguments or between an argument and another word can often
be useful. For example, if the first argument is the subject of the
verb founded and the second argument is the object of the verb
founded, then one can almost immediately tell that the FounderOf
relation exists between the two arguments. Syntactic features can
be derived from parse trees of the sentence containing the relation
instance.
Background knowledge: Chan and Roth studied the use of
background knowledge for relation extraction [18]. An example is
to make use of Wikipedia. If two arguments co-occur in the same
Wikipedia article, the content of the article can be used to check
whether the two entities are related. Another example is word
clusters. For example, if we can group all names of companies such
as IBM and Apple into the same word cluster, we achieve a level
of abstraction higher than words and lower than the general entity
type organization. This level of abstraction may help extraction
of certain relation types such as Acquire between two companies.
Jiang and Zhai proposed a framework to organize the features used
for relation extraction such that a systematic exploration of the feature
space can be conducted [37]. Specifically, a relation instance is represented
as a labeled, directed graph G = (V,E, A,B), where V is the set
of nodes in the graph, E is the set of directed edges in the graph, and
A and B are functions that assign labels to the nodes.
First, for each node v ∈ V , A(v) = {a1, a2, . . . , a|A(v)|} is a set of attributes
associated with node v, where ai ∈ Σ, and Σ is an alphabet that
contains all possible attribute values. For example, if node v represents
a token, then A(v) can include the token itself, its morphological base
form, its part-of-speech tag, etc. If v also happens to be the head word
of arg1 or arg2, then A(v) can also include the entity type. Next, function
B : V → {0, 1, 2, 3} is introduced to distinguish argument nodes
from non-argument nodes. For each node v ∈ V , B(v) indicates how
node v is related to arg1 and arg2. 0 indicates that v does not cover any
argument, 1 or 2 indicates that v covers arg1 or arg2, respectively, and 3
indicates that v covers both arguments. In a constituency parse tree, a node v may represent a
phrase and it can possibly cover both arguments.
Figures 2.4, 2.5 and 2.6 show three relation instance graphs based on the
token sequence, the constituency parse tree and the dependency parse
tree, respectively.
An example sequence representation. The subgraph on the left represents
a bigram feature. The subgraph on the right represents a unigram feature that states the
entity type of arg2.
It can be shown that many features that have been explored in previous
work on relation extraction can be transformed into this graphic
representation. Figures 2.4, 2.5 and 2.6 show some examples.
This framework allows a systematic exploration of the feature space
for relation extraction. To explore the feature space, Jiang and Zhai considered
three levels of small unit features in increasing order of their complexity:
unigram features, bigram features and trigram features. They
found that a combination of features at different levels of complexity
and from different sentence representations, coupled with task-oriented
feature pruning, gave the best performance.
Kernel model
Sequence,
Composite
sequence based
Text clustering. Source: Kunwar 2013.
The amount of text data being generated in the recent years has exploded exponentially. It's essential for
organizations to have a structure in place to mine actionable insights from the text being generated. From social
media analytics to risk management and cybercrime protection, dealing with textual data has never been more
important.
Text clustering is the task of grouping a set of unlabelled texts in such a way that texts in the same cluster are
more similar to each other than to those in other clusters. Text clustering algorithms process text and determine
if natural clusters (groups) exist in the data.
Discussion
The big idea is that documents can be represented numerically as vectors of features. The similarity in
text can be compared by measuring the distance between these feature vectors. Objects that are near
each other should belong to the same cluster. Objects that are far from each other should belong to
different clusters.
o Selecting a suitable distance measure to identify the proximity of two feature vectors.
o A criterion function that tells us that we've got the best possible clusters and stop further processing.
o An algorithm to optimize the criterion function. A greedy algorithm will start with some initial clustering
and refine the clusters iteratively.
o Document Retrieval: To improve recall, start by adding other documents from the same cluster.
Classification is a supervised learning approach that maps an input to an output based on example input-
output pairs. Clustering is a unsupervised learning approach.
o Classification: If the prediction value tends to be category like yes/no or positive/negative, then it falls
under classification type problem in machine learning. The different classes are known in advance. For
example, given a sentence, predict whether it's a negative or positive review.
o Clustering: Clustering is the task of partitioning the dataset into groups called clusters. The goal is to
split up the data in such a way that points within single cluster are very similar and points in different
clusters are different. It determines grouping among unlabelled data.
o Hard Clustering: This groups items such that each item is assigned to only one cluster. For example, we
want to know if a tweet is expressing a positive or negative sentiment. k-means is a hard clustering
algorithm.
o Soft Clustering: Sometimes we don't need a binary answer. Soft clustering is about grouping items such
that an item can belong to multiple clusters. Fuzzy C Means (FCM) is a soft clustering algorithm.
o Text pre-processing: Text can be noisy, hiding information between stop words, inflexions and sparse
representations. Pre-processing makes the dataset easier to work with.
o Feature Extraction: One of the commonly used technique to extract the features from textual data is
calculating the frequency of words/tokens in the document/corpus.
o Clustering: We can then cluster different text documents based on the features we have generated.
o Tokenization: Tokenization is the process of parsing text data into smaller units (tokens) such as words
and phrases.
o Transformation: It converts the text to lowercase, removes all diacritics/accents in the text, and parses
html tags.
o Normalization: Text normalization is the process of transforming a text into a canonical (root) form.
Stemming and lemmatization techniques are used for deriving the root word.
o Filtering: Stop words are common words used in a language, such as 'the', 'a', 'on', 'is', or 'all'. These
words do not carry important meaning for text clustering and are usually removed from texts.
o Document level: It serves to regroup documents about the same topic. Document clustering has
applications in news articles, emails, search engines, etc.
o Sentence level: It's used to cluster sentences derived from different documents. Tweet analysis is an
example.
o Word level: Word clusters are groups of words based on a common theme. The easiest way to build a
cluster is by collecting synonyms for a particular word. For example, WordNet is a lexical database for
the English language that groups English words into sets of synonyms called synsets.
In general, words can be used to represent a common class of feature. Word characteristics are also
features. For example, capitalization matters: US versus us, White House versus white house. Part of
speech and grammatical structure also add to textual features. Semantics can be a textual feature: buy
versus purchase.
The mapping from textual data to real-valued vectors is called feature extraction. One of the simplest
techniques to numerically represent text is Bag of Words (BOW). In BOW, we make a list of unique
words in the text corpus called vocabulary. Then we can represent each sentence or document as a
vector, with each word represented as 1 for presence and 0 for absence.
Another representation is to count the number of times each word appears in a document. The most
popular approach is using the Term Frequency-Inverse Document Frequency (TF-IDF) technique.
More recently, word embeddings are being used to map words into feature vectors. A popular model for
word embeddings is word2vec.
o Lexical similarity: Words are similar lexically if they have a similar character sequence. Lexical similarity
can be measured using string-based algorithms that operate on string sequences and character
composition.
o Semantic similarity: Words are similar semantically if they have the same meaning, are opposite of each
other, used in the same way, used in the same context or one is a type of another. Semantic similarity
can be measured using corpus-based or knowledge-based algorithms.
Some of the metrics for computing similarity between two pieces of text are Jaccard coefficient, cosine
similarity and Euclidean distance.
Which are some common text clustering algorithms?
Some types of text clustering algorithms. Source: Khosla et al. 2019, fig. 4.
o Hierarchical: In the divisive approach, we start with one cluster and split that into sub-clusters. Example
algorithms include DIANA and MONA. In the agglomerative approach, each document starts as its own
cluster and then we merge similar ones into bigger clusters. Examples include BIRCH and CURE.
o Partitioning: k-means is a popular algorithm but requires the right choice of k. Other examples are
ISODATA and PAM.
o Density: Instead of using a distance measure, we form clusters based on how many data points fall
within a given radius. DBSCAN is the most well-known algorithm.
o Graph: Some algorithms have made use of knowledge graphs to assess document similarity. This
addresses the problem of polysemy (ambiguity) and synonymy (similar meaning).
o Probabilistic: A cluster of words belong to a topic and the task is to identify these topics. Words also
have probabilities that they belong to a topic. Topic Modelling is a separate NLP task but it's similar to
soft clustering. pLSA and LDA are example topic models.
Internal quality measure: more compact clusters on the left. Source: Hassani and Seidl 2016.
Measuring the quality of a clustering algorithm has shown to be as important as the algorithm itself. We
can evaluate it in two ways:
o External quality measure: External knowledge is required for measuring the external quality. For
example, we can conduct surveys of users of the application that includes text clustering.
o Internal quality measure: The evaluation of the clustering is compared only with the result itself, that is,
the structure of found clusters and their relations to one another. Two main concepts are compactness
and separation. Compactness measures how closely data points are grouped in a
cluster. Separation measures how different the found clusters are from each other. More formally,
compactness is intra-cluster variance whereas separation is inter-cluster distance.
Document clustering is being studied for many decades. It's far from trivial or a solved problem. The
challenges include the following:
o Implementing the clustering algorithm in an efficient way that makes it feasible in terms of memory
and CPU resources.
Text mining research in general relies on a vector space model. Salton first proposes it to model text documents
as vectors. Features are considered to be the words in the document collection and feature values come from
different term weighting schemes, the most popular of which is the Term Frequency-Inverse Document
Frequency (TF-IDF).
1983
Massart et al. in the book The Interpretation of Analytical Chemical Data by the Use of Cluster
Analysis introduces various clustering methods, including hierarchical and non-hierarchical methods. They
show how clustering can be used to interpret large quantities of analytical data. They discuss how clustering is
related to other pattern recognition techniques.
1992
Cutting et al. adapt partition-based clustering algorithms to cluster documents. Two of the techniques are
Buckshot and Fractionation. Buckshot selects a small sample of documents to pre-cluster them using a standard
clustering algorithm and assigns the rest of the documents to the clusters formed. Fractionation finds k centres
by initially breaking N documents into N/m buckets of a fixed size m > k. Each cluster is then treated as if it's
an individual document and the whole process is repeated until there are only K clusters.
1997
Huang introduces k-modes, an extension to the well-known k-means algorithm for clustering numerical data.
By defining the mode notion for categorical clusters and introducing an incremental update rule for cluster
modes, the algorithm preserves the scaling properties of k-means. Naturally, it also inherits its disadvantages,
such as dependence on the seed clusters and the inability to automatically detect the number of clusters.
2008
Sun et al. develop a novel hierarchal algorithm for document clustering. They use cluster overlapping
phenomenon to design cluster merging criteria. The system computes the overlap rate in order to improve time
efficiency.
This also means that for classifying the correct number of groups is
known, whereas in clustering there is no such number. Note that it is
not just unknown — it simply does not exist. It is up to us to choose a
suitable amount of clusters for our purpose. Many times, this means
trying out a few and then choosing the one which delivered the best
results.
Kinds of clustering methods
Before we dive right into concrete clustering algorithms, let us first
establish some ways in which we can describe and distinguish them.
There are a few ways in which this is possible:
Clustering words
Congratulations! You have made it past the introduction. In the next
few paragraphs, we will look at clustering methods for words. Let’s
look at the following set of them:
When talking about words with similar meaning, you often read about
the distributional hypothesis in linguistics. This hypothesis states
that words bearing a similar meaning will appear between similar
word contexts. You could say “The box is on the shelf.”, but also “The
box is under the shelf.” and still produce a meaningful
sentence. On and under are interchangeable up to a certain extent.
This hypothesis is utilized when creating word embeddings. Word
embeddings map each word of a vocabulary onto a n-dimensional
vector space. Words that have similar contexts will appear roughly in
the same area of the vector space. One of these embeddings was
developed by Weston, Ratle & Collobert in 2008. You can see an
interesting segment of the word vectors (reduced to two dimensions
with t-SNE) here:
Source:
Joseph Turian
, see the full picture here
Notice how neatly months, names and locations are grouped together.
This will come in handy for clustering them in the next step. To learn
more about how exactly word embeddings are created and the
interesting properties they have, take a look at this Medium article by
Hunter Heidenreich
. It also includes information about more advanced word embeddings
like word2vec.
k-means
We will now look at the most famous vector-based clustering algorithm
out there: k-means. What k-means does is returning a cluster
assignment to one of k possible clusters for each object. To
recapitulate what we learned earlier it is a hard, flat
clustering method. Let’s see how the k-means process looks like:
K-means it not the only vector based clustering method out there.
Other often used methods include DBSCAN, a method favoring
densely populated clusters and expectation maximization (EM), a
method that assumes an underlying probabilistic distribution for each
cluster.
Brown clustering
There are also methods for clustering words that do not require the
words to already be available as vectors. Probably the most cited such
technique is brown clustering, proposed in 1992 by Brown et al. (not
related to the brown corpus, which is named after Brown University,
Rhode Island).
You can also look at small sub-trees and find clusters that contain
word pairs close to synonymity such
as evaluation and assessment or conversation and discussion.
Clustering documents
In general, clustering documents can also be done by looking at each
document in vector format. But documents rarely have contexts. You
could imagine a book standing next to other books in a tidy shelf, but
usually this is not what large collections of digital documents (so-
called corpora) look like.
tf-idf
Look at the following toy example containing only two short
documents d1 and d2 and the resulting bag of words vectors:
What you can see is that words that are not very specific
like I and love get rewarded with the same value as the words actually
discerning the two documents like pizza and chocolates. A way to
counteract this behavior is to use tf-idf, a numerical statistic used as a
weighting factor dampening the effects of less important words.
Tf-idf stands for term frequency and inverse document frequency, the
two factors used for weighting. The term frequency is simply the
number of occurrences of a word in a specific document. If our
document is “I love chocolates and chocolates love me”, the term
frequency of the word love would be two. This value is often
normalized by dividing it by the highest term frequency in the given
document, resulting in term frequency values between 0 (for words
not appearing in the document) and 1 (for the most frequent word in
the document). The term frequencies are calculated per word and
document.
Based on these formulas, we get the following values for our toy
example:
Looking at the last two columns, we see that only the most relevant
words receive a high tf-idf value. So-called stop words, meaning
words that are ubiquitous in our document collection, get a value of or
close to 0.
The received tf-idf vectors are still as high dimensional as the original
bag of words vectors. Therefore, dimensionality reduction techniques
such as latent semantic indexing (LSI) are often used to make them
easier to handle. Algorithms such as k-means, DBSCAN and EM can be
used on document vectors, too, just as described earlier for word
clustering. Possible distance measures include euclidean and cosine
distance.
But this is not all that LDA provides us with. Additionally, it tells us for
each document which topics appear in it and to which percentage. For
example, an article about a new device which can detect a specific
disease prevalence in your DNA may consist of the topic mix
48% Disease, 31% Genetics and 21% Computers.
Note that even though a pizza emoji is much less likely to be drawn
from topic 3, it is still possible to originate from it. Now that we have
the result we want, we only have to find a way to reverse this process.
Only? In reality, we are faced with this:
In theory, we could make our computer try out every possible
combination of words and topics. Besides the fact that this would
probably take an eternity, how do we know in the end which
combination makes sense and which one doesn’t? For this, the
Dirichlet distribution comes in handy. Instead of drawing the
distribution like in the image above, let’s draw our document on a
topic simplex in the respective position.
We can also draw a lot of other documents from our corpus next to
this one. It could look like this:
Or, if we had chosen other topics beforehand, like this:
While in the first variant, the documents are clearly discernible and
empathize different topics, the documents in the second variant are all
more or less alike. The topics chosen here were not able to separate
the documents in a meaningful way. These two possible document
distributions are nothing else than two different Dirichlet
distributions! This means we have found a way of describing what
“good” distributions over topics look like!
The same principle applies to words in topics. Good topics will have
different distributions over words, while bad topics will have about the
same words as others. These two appearances of Dirichlet
distributions are described in the LDA model by two hyperparameters,
alpha and beta. As a general rule, you want to keep your Dirichlet
parameters below one. Take a look at how different values change the
distribution in this brilliant animation made by David
Lettier
.
If you want to try out what LDA does to a data set of your choice
interactively, click your way through this great in-browser demo by
David Mimno.
4. Word to Vectors
Code snippet
Sklearn library will do all computation for us just in a single line of
code. We can handle vocabulary size, remove stopwords, handle
scoring methods(binary or count), building the vocabulary ignore
terms that have a document frequency higher than or lower than the
given threshold etc. using Sklearn. Have a look at sklearn’s detailed
documentation here .
The scores are a weighting where not all words are equally as
important or interesting. The scores have the effect of highlighting
words that are distinct (contain useful information) in a given
document. Rare the word like ‘Supine’, ‘Idyllic’ get highly scored
compared to common words like ‘the’, ‘is’ , ‘and’ etc.
Below we will see how to compute tf-idf for document “This movie is
not scary and is slow”
Similarly we compute tf-idf values for all the words in document and
the final tf-idf vector looks as below for document 2
Code snippet
Same as CountVectorizer TfidfVectorizer can handle text
preprocessing steps like stopword removal, lowercase , vocabulary
size etc.
You can see that TfidfVector which we derived for doc2 varies from
TfidfVector got out of Sklearn function.
Different Notation
TF remains the same, while IDF is different:“ Some constant 1” is
added to the numerator and denominator of the IDF as if an extra
document was seen containing every term in the collection exactly
once, which prevents zero divisions” a more empirical approach, while
the standard notation present in most textbooks doesn’t have the
constant 1. For more detailed explanation and step by step derivation
have a loot at this blog.
Keras offers an Embedding layer that can be used for neural networks
on text data. It requires that the input data be integer encoded, so that
each word is represented by a unique integer. The Embedding layer is
initialized with random weights and will learn an embedding for all of
the words in the training dataset.
You can copy paste the code snippet in your python notebook, modify
the input shape and Embedding layer’s input_length argument and
play with it for better understanding. For lot more examples
click here.
4. Word to Vectors
The code snippet shows that when we subtract vector for man from
vector of king and add vector of women the closest word vector we get
is queen.
king — man + woman = queen.The “man-ness” in king is replaced
with “woman-ness” to give us queen.
In data mining and statistics, hierarchical clustering analysis is a method of
clustering analysis that seeks to build a hierarchy of clusters i.e. tree-type
structure based on the hierarchy.
In machine learning, clustering is the unsupervised learning technique that
groups the data based on similarity between the set of data. There are different-
different types of clustering algorithms in machine learning. Connectivity-
based clustering: This type of clustering algorithm builds the cluster based on
the connectivity between the data points. Example: Hierarchical clustering
Centroid-based clustering: This type of clustering algorithm forms
around the centroids of the data points. Example: K-Means clustering, K-
Mode clustering
Distribution-based clustering: This type of clustering algorithm is
modeled using statistical distributions. It assumes that the data points in
a cluster are generated from a particular probability distribution, and the
algorithm aims to estimate the parameters of the distribution to group
similar data points into clusters Example: Gaussian Mixture Models
(GMM)
Density-based clustering: This type of clustering algorithm groups
together data points that are in high-density concentrations and
separates points in low-concentrations regions. The basic idea is that it
identifies regions in the data space that have a high density of data
points and groups those points together into clusters.
Example: DBSCAN(Density-Based Spatial Clustering of Applications with
Noise)
In this article, we will discuss connectivity-based clustering algorithms i.e
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups the
data points together that are close to each other based on the measure of
similarity or distance. The assumption is that data points that are close to each
other are more similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the
hierarchical relationships between groups. Individual data points are located at
the bottom of the dendrogram, while the largest clusters, which include all the
data points, are located at the top. In order to generate different numbers of
clusters, the dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a
measure of similarity or distance between data points. Clusters are divided or
merged repeatedly until all data points are contained within a single cluster, or
until the predetermined number of clusters is attained.
We can look at the dendrogram and measure the height at which the branches
of the dendrogram form distinct clusters to calculate the ideal number of
clusters. The dendrogram can be sliced at this height to determine the number
of clusters.
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative
clustering (HAC). A structure that is more informative than the unstructured set
of clusters returned by flat clustering. This clustering algorithm does not require
us to prespecify the number of clusters. Bottom-up algorithms treat each data as
a singleton cluster at the outset and then successively agglomerate pairs of
clusters until all clusters have been merged into a single cluster that contains all
data.
Algorithm :
given a dataset (d 1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[d i, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
Hierarchical Agglomerative Clustering
Steps:
Consider each alphabet as a single cluster and calculate the distance of
one cluster from all the other clusters.
In the second step, comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to
each other therefore we merge them in the second step similarly to
cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
We recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
Repeating the same process; The clusters DEF and BC are comparable
and merged together to form a new cluster. We’re now left with clusters
[(A), (BCDEF)].
At last, the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
Python implementation of the above algorithm using the scikit-learn library:
Python3
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print(clustering.labels_)
Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to
prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters
recursively until individual data have been split into singleton clusters.
Algorithm :
given a dataset (d 1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Hierarchical Divisive clustering
Implementations code
Python3
import numpy as np
Z = linkage(X, 'ward')
# Plot dendrogram
dendrogram(Z)
plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()
Output:
1. LDA
2. Hyperparameters in LDA
3. LDA in Python
4. Shortcomings of LDA
5. Alternative
LDA in Python
Let us look at an implementation of LDA. We will try to extract topics
from a set of reviews.
dataset
Feature Extraction:
Vectorization:
Since LDA has an inbuilt TF-IDF vectorizer, we will have to use Count
vectorizer.
Latent Dirichlet Allocation:
Topic Assignments:
Shortcomings of LDA:
1. LDA performs poorly on small texts; most of our data was
short.
2. Since the reviews are not coherent, LDA finds it all the
more difficult to
identify the topics
3. Since the reviews are mainly context-based, hence word
co-occurrences
based models fail.