0% found this document useful (0 votes)
28 views8 pages

NLP 2

NLP 2 msc cs notes unit II

Uploaded by

Sayli Gawde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views8 pages

NLP 2

NLP 2 msc cs notes unit II

Uploaded by

Sayli Gawde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

MODULE - II Unit 3:Semantics and Word Embedding

Semantics Vector Semantics: Words and Vector; Measuring Similarity; Semantics with dense vectors;
SVD and Latent Semantic Analysis Embeddings from prediction: Skip-gram and Continuous Bag of words;
Concept of Word Sense; Introduction to WordNet

Words and Vector in NLP


In Natural Language Processing (NLP), words and vectors are fundamental concepts for
representing and processing textual data. Here's an overview:

Words in NLP

1. Tokenization:
o The process of breaking down text into individual units called tokens, which can
be words, subwords, or characters.
o For example, "Natural Language Processing" might be tokenized into ["Natural",
"Language", "Processing"].
2. Normalization:
o Techniques to bring uniformity to text data, including lowercasing, removing
punctuation, and stemming/lemmatization.
o For instance, converting "Running" and "ran" to "run" using stemming or
lemmatization.
3. Stop Words:
o Common words like "and", "the", "is" that are often removed in preprocessing
because they carry less meaningful information.

Vectors in NLP

Vectors are numerical representations of text data, allowing for mathematical and computational
manipulation. Key techniques include:

1. One-Hot Encoding:
o Represents words as vectors of 0s and 1s, with each dimension corresponding to a
word in the vocabulary.
o Limitations include high dimensionality and lack of semantic similarity.
2. Word Embeddings:
o Dense vectors that capture semantic relationships between words.
o Techniques include:
 Word2Vec: Uses neural networks to learn embeddings based on the
context of words. Produces vectors where similar words have similar
embeddings.
 GloVe (Global Vectors for Word Representation): Combines local
context and global statistical information from a text corpus to produce
word embeddings.
FastText: Similar to Word2Vec but considers subword information,

making it effective for handling rare words and morphological variations.
3. Contextualized Word Embeddings:
o Dynamic embeddings that take into account the context of a word in a sentence.
o Techniques include:
 ELMo (Embeddings from Language Models): Generates context-
sensitive embeddings using deep bidirectional LSTM networks.
 BERT (Bidirectional Encoder Representations from Transformers):
Uses transformer architecture to produce embeddings that are context-
dependent, considering both left and right context.
4. Sentence and Document Embeddings:
o Representing entire sentences or documents as single vectors.
o Techniques include:
 Doc2Vec: An extension of Word2Vec that generates embeddings for
larger text units like sentences and paragraphs.
 Universal Sentence Encoder (USE): Uses transformers or deep
averaging networks to produce embeddings for sentences.
 BERT-based models: Variants of BERT fine-tuned for producing
sentence or document-level embeddings, like Sentence-BERT.

Applications

 Text Classification: Assigning categories to text based on its content.


 Named Entity Recognition (NER): Identifying and classifying proper nouns in text.
 Machine Translation: Translating text from one language to another.
 Sentiment Analysis: Determining the sentiment expressed in a piece of text.
 Question Answering: Building systems that can answer questions based on text input.

Understanding words and their vector representations is crucial for building effective NLP
models and applications. If you have any specific questions or need further details on any of
these topics, feel free to ask!

Measuring Similarity in NLP

Measuring similarity in NLP is essential for various applications, such as information retrieval,
text classification, and clustering. Here are some common techniques and methods used to
measure similarity between text units:

Similarity Measures for Word Vectors

1. Cosine Similarity:
o Measures the cosine of the angle between two vectors.
o Formula: cosine_similarity(A,B)=A⋅B∥A∥∥B∥\text{cosine\_similarity}(A, B) = \
frac{A \cdot B}{\|A\| \|B\|}cosine_similarity(A,B)=∥A∥∥B∥A⋅B
o Values range from -1 to 1, where 1 indicates identical vectors, 0 indicates
orthogonal vectors, and -1 indicates opposite vectors.
2. Euclidean Distance:
o Measures the straight-line distance between two points in a vector space.
o Formula: euclidean_distance(A,B)=∑i=1n(Ai−Bi)2\text{euclidean\_distance}(A,
B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}euclidean_distance(A,B)=i=1∑n(Ai
−Bi)2
o Smaller distances indicate higher similarity.
3. Manhattan Distance (L1 Norm):
o Measures the sum of the absolute differences between the coordinates of the
vectors.
o Formula: manhattan_distance(A,B)=∑i=1n∣Ai−Bi∣\text{manhattan\_distance}(A,
B) = \sum_{i=1}^{n} |A_i - B_i|manhattan_distance(A,B)=i=1∑n∣Ai−Bi∣
o Often used when the differences between coordinates are more significant.
4. Jaccard Similarity:
o Measures similarity between two sets by dividing the size of their intersection by
the size of their union.
o Formula: jaccard_similarity(A,B)=∣A∩B∣∣A∪B∣\text{jaccard\_similarity}(A, B) =
\frac{|A \cap B|}{|A \cup B|}jaccard_similarity(A,B)=∣A∪B∣∣A∩B∣
o Values range from 0 to 1, where 1 indicates identical sets.

Similarity Measures for Sentences and Documents

1. Bag-of-Words (BoW):
o Represents text as a set of word counts.
o Similarity measures (e.g., cosine similarity) can be applied to BoW vectors.
2. TF-IDF (Term Frequency-Inverse Document Frequency):
o Adjusts the BoW model by weighting terms based on their importance in a
document relative to a corpus.
o Similarity measures (e.g., cosine similarity) can be applied to TF-IDF vectors.
3. Word Mover's Distance (WMD):
o Measures the distance between two documents by considering the minimum
cumulative distance required to move words from one document to another.
o Useful when using word embeddings.
4. Sentence Embeddings:
o Techniques like Universal Sentence Encoder (USE), Sentence-BERT, and
Doc2Vec generate embeddings for entire sentences or documents.
o Cosine similarity is commonly used to measure similarity between sentence
embeddings.

Contextual Similarity

1. Contextualized Word Embeddings:


o Models like BERT, ELMo, and GPT-3 produce embeddings that consider the
context of a word in a sentence.
o Cosine similarity can be used to compare these embeddings, providing a more
nuanced similarity measure.

Applications of Similarity Measures

1. Information Retrieval:
o Finding documents or passages that are similar to a query.
o Example: Search engines.
2. Text Classification:
o Grouping texts into predefined categories based on their content.
o Example: Spam detection in emails.
3. Clustering:
o Grouping similar texts into clusters.
o Example: Topic modeling.
4. Plagiarism Detection:
o Identifying duplicate or highly similar content across documents.
5. Recommender Systems:
o Suggesting items (e.g., articles, products) similar to what a user has interacted
with.
6. Paraphrase Detection:
o Identifying whether two sentences or texts express the same idea.

These measures and techniques help in analyzing and processing textual data by quantifying the
similarity between different text units.

Semantics with dense vectors

Dense vectors in NLP, such as Word2Vec, GloVe, and BERT embeddings, encode semantic
meaning effectively. These embeddings capture relationships between words and context, aiding
tasks like text classification, machine translation, and question answering by understanding
nuanced language semantics. They're trained to represent words or sentences in a continuous
vector space, where similar vectors denote similar meanings. This approach reduces
dimensionality compared to one-hot encoding, enhancing computational efficiency while
supporting transfer learning across diverse NLP applications.

Dense vectors (embeddings) in NLP encode semantic meaning:

 Word Embeddings (Word2Vec, GloVe, FastText) capture relationships between words.


 Contextualized Embeddings (ELMo, BERT) provide context-aware representations.
 They enable accurate text classification, machine translation, and question answering by
understanding nuanced meanings.
 Applications include sentiment analysis, information retrieval, and named entity
recognition.
 Dense vectors reduce computational complexity compared to sparse representations like
one-hot encoding.
 Challenges include interpretability and addressing biases inherent in training data.

SVD (Singular Value Decomposition) and Latent Semantic Analysis (LSA) are
techniques used in NLP for generating embeddings that capture latent semantic information from
text data:

1. SVD (Singular Value Decomposition):


o SVD is a matrix factorization method that decomposes a matrix into singular
vectors and singular values.
o In NLP, it's applied to the term-document matrix to uncover latent semantic
relationships between terms and documents.
o Embeddings are derived from the reduced dimensions of the matrix, capturing
semantic associations based on co-occurrence patterns.
2. Latent Semantic Analysis (LSA):
o LSA is based on SVD and involves reducing the dimensions of the term-
document matrix to extract latent semantic relationships.
o It transforms words into a lower-dimensional space where semantically similar
words have closer embeddings.
o LSA embeddings are used in tasks such as information retrieval, text
classification, and synonym extraction.

Applications in NLP:

 Information Retrieval: SVD and LSA embeddings help in finding relevant documents
based on semantic similarity rather than exact keyword matching.
 Text Classification: Embeddings derived from SVD/LSA can enhance the classification
accuracy by capturing underlying semantic relationships between words.
 Semantic Search: These embeddings improve the effectiveness of search engines by
understanding the semantic meaning of queries and documents.
 Topic Modeling: SVD/LSA can identify underlying topics in a collection of documents
by clustering terms that are semantically related.

In summary, SVD and LSA provide methods for generating embeddings that capture latent
semantic information in NLP tasks, contributing to more effective and nuanced text analysis and
processing.
In NLP, both Skip-gram and Continuous Bag of Words (CBOW) are algorithms
used to train word embeddings, particularly within the Word2Vec framework. Here’s a concise
overview of each:

1. Continuous Bag of Words (CBOW):


o Objective: Predict a target word based on its context.
o Training: CBOW takes a context of surrounding words (e.g., "the cat sat on the")
to predict the target word ("mat").
o Advantages: Efficient training with small datasets, good for frequent words.
2. Skip-gram:
o Objective: Predict context words given a target word.
o Training: Skip-gram predicts the context words ("the", "cat", "on", "the") from
the target word ("sat").
o Advantages: Effective with larger datasets and capturing less frequent words.

Applications:

 Semantic Similarity: Both models learn embeddings where similar words have closer
vector representations, aiding in tasks like synonym detection.
 Analogical Reasoning: Models can compute relationships like "man" is to "woman" as
"king" is to "queen" through vector arithmetic.
 Natural Language Understanding: Used in various NLP tasks such as machine
translation, sentiment analysis, and named entity recognition.

Choosing Between Skip-gram and CBOW:

 CBOW is faster to train and generally performs better with frequent words and smaller
datasets.
 Skip-gram captures more nuanced relationships and is preferred for larger datasets with
diverse vocabulary.

Both models leverage the distributional hypothesis that words appearing in similar contexts tend
to have similar meanings, making them foundational for learning semantic representations in
NLP.

The concept of word sense in NLP is directly tied to word sense disambiguation (WSD). It
refers to the different meanings a word can have depending on the context in which it's used.

Here's a breakdown:
 Word Sense: Each distinct meaning a word can have is considered a word sense. For
instance, "bat" can refer to a flying mammal or a wooden club used in sports. These are
two distinct word senses of "bat."
 Word Sense Disambiguation (WSD): This is the process of identifying the intended
meaning of a word in a specific context. NLP systems grapple with ambiguity because
many words have multiple meanings. WSD helps determine the correct sense based on
surrounding words and the overall context.

Examples of Word Senses:

 Bank: Financial institution (e.g., "I went to the bank to deposit a check.") or the edge of a
river (e.g., "We sat by the river bank and enjoyed the sunset.")
 Light: Not heavy (e.g., "This box is very light.") or illumination (e.g., "Turn off the light
before you leave.")

Why is Word Sense Disambiguation Important?

 Accuracy: Accurate WSD is crucial for various NLP tasks like machine translation,
question answering, and sentiment analysis. The wrong sense can lead to nonsensical
translations, irrelevant answers, and misinterpreting the sentiment of a text.
 Context Matters: Consider the sentence "The code was cracked." Here, "cracked" could
mean broken (referring to a physical object) or deciphered (referring to a coded message).
WSD helps determine the intended meaning based on context.

Challenges of Word Sense Disambiguation:

 Homonymy: Homonyms are words with the same spelling and pronunciation but
different meanings (e.g., "bat"). Distinguishing between them requires deep contextual
understanding.
 Polysemy: A single word can have multiple related meanings (e.g., "light"). WSD needs
to identify the most relevant sense based on the context.

Approaches to Word Sense Disambiguation:

 Supervised Learning: Training models on manually annotated data where each word
occurrence has its designated sense.
 Unsupervised Learning: Analyzing word co-occurrence patterns and clustering words
based on similar contexts.
 Knowledge-based Methods: Leveraging resources like WordNet, a lexical database
containing semantic relationships between words.

Overall, word sense is a fundamental concept in NLP, and WSD plays a critical role in ensuring
accurate interpretation and processing of natural language.
Unit 4: NLP Applications and Case Studies Intelligent Work Processors: Machine Translation; User
Interfaces; man-machine Interfaces: Natural language Querying Tutoring and Authoring Systems. Speech
Recognition Commercial use of NLP: NLP in customer Service, Sentiment Analysis, Emotion Mining,
Handling Frauds and SMS, Bots, LSTM & BERT models, Conversations

You might also like