Distributional semantics

Distributional Semantics
Rabindra Nath Nandi

Clustering maps of biomedical articles

Distributional semantics is a research area that develops and studies theories
and methods for quantifying and categorizing semantic similarities between
linguistic items based on their distributional properties in large samples of
language data.
Distributional hypothesis: linguistic items with similar distributions have similar
meaning
a word is characterized by the company it keeps" was popularized by Firth

❖ Depends on Statistical Semantics
statistical semantics applies the methods of statistics to the problem of
determining the meaning of words or phrases, ideally through unsupervised
learning, to a degree of precision at least sufficient for the purpose of information
retrieval.

Theory
Distributional semantics favor the use of linear algebra as computational tool and
representational framework.
The basic approach is to collect distributional information in high-dimensional
vectors, and to define distributional/semantic similarity in terms of vector similarity.

Basics Theory of NLP
Corpus: A corpus is a large body of natural language text used for accumulating
statistics on natural language text.
Lexicons: A collection of information about the words of a language about the
lexical categories to which they belong. A lexicon is usually structured as a
collection of lexical entries, like ("pig" N V ADJ). "pig" is familiar as a N, but also
occurs as a verb ("Jane pigged herself on pizza") and an adjective, in the phrase
"pig iron".

Distributional semantic models (DSMs)
❖ Idea of using corpus-based statistics to extract information about semantic
properties of words and other linguistic units is extremely common in
computational linguistics.
❖ 1) Word-Doc Distribution ,2) Topic-Doc Distribution 3) Word distribution is a
semantic Space.
Usecase:
Information retrieval , document clustering, document quick understanding

Distributional Semantics Models
❖ Term frequency–Inverse document frequency(tf-idf)
❖ Latent Semantic Analysis(LSA)
❖ Latent Dirichlet Allocation (LDA)
❖ WordEmbedding (word2vec)

❖ Term frequency–Inverse document frequency(tf-idf)
Term Frequency: The number of times a term occurs in a document is called its
term frequency.
Inverse document frequency: An inverse document frequency factor is
incorporated which diminishes the weight of terms that occur very frequently in the
document set and increases the weight of terms that occur rarely.

the idf of a rare term is high, whereas the idf of a frequent term is likely to be low.

- Dimensionality Reduction
- Finding latent relationship between words and documents
- Words and documents are sorted by their relationship

Dimensionality Reduction
Reduce the target-word-by-context matrix to a lower dimensionality matrix (a
matrix with less – linearly independent – columns/dimensions).
Two main reasons: 1)Smoothing: capture “latent dimensions” that generalize over
sparser surface dimensions (Singular Value Decomposition or SVD)
2)Efficiency/space: sometimes the matrix is so large that you don’t even want to
construct it explicitly (Random Indexing)

Where the animation video !!

Ranking System Design:
Query => {terms}
Docs={term-document,term-concepts}
Query on Docs => Finding relevant docs.

Distributional semantics

More Related Content

What's hot (20)

Similar to Distributional semantics (20)

Recently uploaded (20)

Distributional semantics