0% found this document useful (0 votes)
553 views5 pages

Unsupervised Learning Hyperlex

Uploaded by

Kranti Gajmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
553 views5 pages

Unsupervised Learning Hyperlex

Uploaded by

Kranti Gajmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Unsupervised-learning approaches HyperLex

The HyperLex algorithm is an unsupervised approach for Word Sense


Disambiguation (WSD) that operates without a predefined sense inventory or
labelled data. Instead, it induces word senses directly from a large text corpus by
leveraging graph theory and the "small-world" property of word co-occurrence
networks.

Core principles
The central idea behind HyperLex is that different senses of a word tend to co-occur
with different sets of related words. These co-occurrences form distinct, highly
interconnected clusters within a larger co-occurrence graph.
 Small-world graphs: Word co-occurrence graphs exhibit "small-world" properties,
meaning that while the overall graph is very large, any node can be reached from
any other node via a short path.

 Highly connected components: Within this small-world graph, the different senses
of an ambiguous word appear as tightly interconnected "bundles" of co-occurring
words, also known as high-density components.

 Hubs: The most central and highly connected words within these high-density
components are called "hubs." These hubs act as prototypes for each distinct word
sense.

The HyperLex process


1. Corpus selection: First, a sub-corpus is extracted containing all paragraphs or
sentences where the target ambiguous word appears.

2. Graph construction: A co-occurrence graph is built for the target word using this
sub-corpus.

1. Nodes: The nodes of the graph represent the content words (nouns, verbs,
adjectives) that co-occur with the target word within the context window (e.g., a
paragraph).

2. Edges: An edge is drawn between two words if they co-occur in the same
paragraph. The weight of the edge is typically inversely proportional to the frequency
of co-occurrence, indicating a stronger relationship for less frequent but more
specific pairings.

3. Hub detection: The algorithm identifies the high-density, highly-connected


components within the graph. The most central nodes within these components are
designated as "hubs". These hubs represent the prototypes for each of the target
word's senses.

4. Disambiguation: To disambiguate a new instance of the target word, its context


words are compared to the known hubs. The sense associated with the closest hub
(or the most similar hub-component) is assigned to the word.
Strengths and weaknesses

Strengths Weaknesses

No labeled data required: Because it Parameter sensitivity: The algorithm's


is an unsupervised method, HyperLex performance is heavily influenced by a set of
does not require any sense-tagged heuristic parameters, such as the context window
training data. size and minimum co-occurrence frequency.

Corpus-based sense induction: The Limited granularity: While it is effective at


senses are induced directly from the distinguishing between coarse-grained,
corpus, making them specific to the polysemous uses, HyperLex may struggle to
domain of the text and more flexible differentiate between very fine-grained word
than fixed-sense inventories from senses.
lexicons like WordNet.

Handles rare senses: It is capable of Heuristic limitations: The core algorithm for
isolating and identifying infrequent word detecting hubs and high-density components in
uses by detecting hubs and high- large graphs is an NP-hard problem, so HyperLex
density components, something that relies on approximate algorithms and heuristics.
earlier word-vector methods struggled
with.

Effective for information Complexity for interpretation: Since it doesn't


retrieval: HyperLex was originally use a predefined sense inventory, the "senses" or
developed for information retrieval and "uses" discovered by HyperLex are simply
showed excellent performance in clusters of co-occurring words (hubs). Mapping
identifying relevant contexts for these induced senses to standard, human-
ambiguous query words. understandable senses requires a separate step.

Example
For a practical example of the HyperLex algorithm, let's consider the ambiguous
word "bank" using a large, raw text corpus. The algorithm will induce its different
senses without any prior knowledge or human labeling.

Step 1: Sub-corpus extraction


 First, we collect all paragraphs or sentences from a large corpus where the word
"bank" appears.

 Example paragraphs:

o Context 1: "He walked along the river bank and watched the boats sail by."
o Context 2: "She deposited her savings at the local bank."

o Context 3: "The company took a loan from the investment bank."

o Context 4: "Birds nested in the mud bank after the flood receded."

Step 2: Co-occurrence graph construction


 A co-occurrence graph is built using the content words found in the sub-corpus,
excluding the target word "bank.".

 Nodes: The nodes of the graph would be content words


like: river , boats , sail , savings , deposited , local , company , loan , inves
tment , birds , nested , mud , flood .

 Edges: Edges connect words that co-occur within the same paragraph. The weight
of the edge indicates the strength of the relationship. Stronger, less frequent co-
occurrences are given more weight.

o Edges in Sense A (river): (river, boats) , (river, sail) , (boats, sail) .

o Edges in Sense B (financial): (deposited, savings) , (deposited,


bank) , (savings, bank) .

o Edges in Sense C (mud): (birds, nested) , (birds, mud) , (nested, flood) .

Step 3: Hub detection


 The algorithm analyzes the graph to find high-density components or clusters. The
most central, highly connected nodes within these clusters are the "hubs."

 Root Hubs detected:

o Hub for Sense A (river): river

o Hub for Sense B (financial): deposited

o Hub for Sense C (mud): mud

o The induced senses would be represented by these hubs and their surrounding
clusters of co-occurring words.

Step 4: Disambiguation
 To disambiguate a new sentence, its context words are compared to the induced
hubs. The sentence is assigned the sense corresponding to the closest-matching
hub.

 New Sentence: "The investor secured a loan from the bank."

 Disambiguation process:

o The context words are investor , secured , loan .

o The algorithm compares these words with the induced hubs: river , deposited ,
and mud .

o The words loan and investor show a strong co-occurrence relation with
the deposited hub, representing the financial sense.

 Result: The algorithm assigns the financial institution sense to "bank" in this
sentence because its context aligns with the deposited hub cluster.

Step 5: Interpretation (manual)


Since HyperLex operates without a dictionary, a human would need to inspect the
discovered "senses" and their corresponding hubs to understand their meaning.

 Sense A (Hub: river ): This cluster of words ( river , boats , sail , water )
corresponds to the "river bank" sense.

 Sense B (Hub: deposited ): This cluster


( deposited , savings , loan , investment ) corresponds to the "financial
institution" sense.

 Sense C (Hub: mud ): This cluster ( mud , birds , nested , flood ) corresponds to
the less frequent "mud bank" or "sand bank" sense. This demonstrates HyperLex's
ability to find infrequent senses.

You might also like