Unsupervised-learning approaches HyperLex
The HyperLex algorithm is an unsupervised approach for Word Sense
Disambiguation (WSD) that operates without a predefined sense inventory or
labelled data. Instead, it induces word senses directly from a large text corpus by
leveraging graph theory and the "small-world" property of word co-occurrence
networks.
Core principles
The central idea behind HyperLex is that different senses of a word tend to co-occur
with different sets of related words. These co-occurrences form distinct, highly
interconnected clusters within a larger co-occurrence graph.
Small-world graphs: Word co-occurrence graphs exhibit "small-world" properties,
meaning that while the overall graph is very large, any node can be reached from
any other node via a short path.
Highly connected components: Within this small-world graph, the different senses
of an ambiguous word appear as tightly interconnected "bundles" of co-occurring
words, also known as high-density components.
Hubs: The most central and highly connected words within these high-density
components are called "hubs." These hubs act as prototypes for each distinct word
sense.
The HyperLex process
1. Corpus selection: First, a sub-corpus is extracted containing all paragraphs or
sentences where the target ambiguous word appears.
2. Graph construction: A co-occurrence graph is built for the target word using this
sub-corpus.
1. Nodes: The nodes of the graph represent the content words (nouns, verbs,
adjectives) that co-occur with the target word within the context window (e.g., a
paragraph).
2. Edges: An edge is drawn between two words if they co-occur in the same
paragraph. The weight of the edge is typically inversely proportional to the frequency
of co-occurrence, indicating a stronger relationship for less frequent but more
specific pairings.
3. Hub detection: The algorithm identifies the high-density, highly-connected
components within the graph. The most central nodes within these components are
designated as "hubs". These hubs represent the prototypes for each of the target
word's senses.
4. Disambiguation: To disambiguate a new instance of the target word, its context
words are compared to the known hubs. The sense associated with the closest hub
(or the most similar hub-component) is assigned to the word.
Strengths and weaknesses
Strengths Weaknesses
No labeled data required: Because it Parameter sensitivity: The algorithm's
is an unsupervised method, HyperLex performance is heavily influenced by a set of
does not require any sense-tagged heuristic parameters, such as the context window
training data. size and minimum co-occurrence frequency.
Corpus-based sense induction: The Limited granularity: While it is effective at
senses are induced directly from the distinguishing between coarse-grained,
corpus, making them specific to the polysemous uses, HyperLex may struggle to
domain of the text and more flexible differentiate between very fine-grained word
than fixed-sense inventories from senses.
lexicons like WordNet.
Handles rare senses: It is capable of Heuristic limitations: The core algorithm for
isolating and identifying infrequent word detecting hubs and high-density components in
uses by detecting hubs and high- large graphs is an NP-hard problem, so HyperLex
density components, something that relies on approximate algorithms and heuristics.
earlier word-vector methods struggled
with.
Effective for information Complexity for interpretation: Since it doesn't
retrieval: HyperLex was originally use a predefined sense inventory, the "senses" or
developed for information retrieval and "uses" discovered by HyperLex are simply
showed excellent performance in clusters of co-occurring words (hubs). Mapping
identifying relevant contexts for these induced senses to standard, human-
ambiguous query words. understandable senses requires a separate step.
Example
For a practical example of the HyperLex algorithm, let's consider the ambiguous
word "bank" using a large, raw text corpus. The algorithm will induce its different
senses without any prior knowledge or human labeling.
Step 1: Sub-corpus extraction
First, we collect all paragraphs or sentences from a large corpus where the word
"bank" appears.
Example paragraphs:
o Context 1: "He walked along the river bank and watched the boats sail by."
o Context 2: "She deposited her savings at the local bank."
o Context 3: "The company took a loan from the investment bank."
o Context 4: "Birds nested in the mud bank after the flood receded."
Step 2: Co-occurrence graph construction
A co-occurrence graph is built using the content words found in the sub-corpus,
excluding the target word "bank.".
Nodes: The nodes of the graph would be content words
like: river , boats , sail , savings , deposited , local , company , loan , inves
tment , birds , nested , mud , flood .
Edges: Edges connect words that co-occur within the same paragraph. The weight
of the edge indicates the strength of the relationship. Stronger, less frequent co-
occurrences are given more weight.
o Edges in Sense A (river): (river, boats) , (river, sail) , (boats, sail) .
o Edges in Sense B (financial): (deposited, savings) , (deposited,
bank) , (savings, bank) .
o Edges in Sense C (mud): (birds, nested) , (birds, mud) , (nested, flood) .
Step 3: Hub detection
The algorithm analyzes the graph to find high-density components or clusters. The
most central, highly connected nodes within these clusters are the "hubs."
Root Hubs detected:
o Hub for Sense A (river): river
o Hub for Sense B (financial): deposited
o Hub for Sense C (mud): mud
o The induced senses would be represented by these hubs and their surrounding
clusters of co-occurring words.
Step 4: Disambiguation
To disambiguate a new sentence, its context words are compared to the induced
hubs. The sentence is assigned the sense corresponding to the closest-matching
hub.
New Sentence: "The investor secured a loan from the bank."
Disambiguation process:
o The context words are investor , secured , loan .
o The algorithm compares these words with the induced hubs: river , deposited ,
and mud .
o The words loan and investor show a strong co-occurrence relation with
the deposited hub, representing the financial sense.
Result: The algorithm assigns the financial institution sense to "bank" in this
sentence because its context aligns with the deposited hub cluster.
Step 5: Interpretation (manual)
Since HyperLex operates without a dictionary, a human would need to inspect the
discovered "senses" and their corresponding hubs to understand their meaning.
Sense A (Hub: river ): This cluster of words ( river , boats , sail , water )
corresponds to the "river bank" sense.
Sense B (Hub: deposited ): This cluster
( deposited , savings , loan , investment ) corresponds to the "financial
institution" sense.
Sense C (Hub: mud ): This cluster ( mud , birds , nested , flood ) corresponds to
the less frequent "mud bank" or "sand bank" sense. This demonstrates HyperLex's
ability to find infrequent senses.