0% found this document useful (0 votes)

28 views8 pages

NLP 2

NLP 2 msc cs notes unit II

Uploaded by

Sayli Gawde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views8 pages

NLP 2

NLP 2 msc cs notes unit II

Uploaded by

Sayli Gawde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

MODULE - II Unit 3:Semantics and Word Embedding

Semantics Vector Semantics: Words and Vector; Measuring Similarity; Semantics with dense vectors;
SVD and Latent Semantic Analysis Embeddings from prediction: Skip-gram and Continuous Bag of words;
Concept of Word Sense; Introduction to WordNet

Words and Vector in NLP

In Natural Language Processing (NLP), words and vectors are fundamental concepts for
representing and processing textual data. Here's an overview:

Words in NLP

1. Tokenization:
o The process of breaking down text into individual units called tokens, which can
be words, subwords, or characters.
o For example, "Natural Language Processing" might be tokenized into ["Natural",
"Language", "Processing"].
2. Normalization:
o Techniques to bring uniformity to text data, including lowercasing, removing
punctuation, and stemming/lemmatization.
o For instance, converting "Running" and "ran" to "run" using stemming or
lemmatization.
3. Stop Words:
o Common words like "and", "the", "is" that are often removed in preprocessing
because they carry less meaningful information.

Vectors in NLP

Vectors are numerical representations of text data, allowing for mathematical and computational
manipulation. Key techniques include:

1. One-Hot Encoding:
o Represents words as vectors of 0s and 1s, with each dimension corresponding to a
word in the vocabulary.
o Limitations include high dimensionality and lack of semantic similarity.
2. Word Embeddings:
o Dense vectors that capture semantic relationships between words.
o Techniques include:
 Word2Vec: Uses neural networks to learn embeddings based on the
context of words. Produces vectors where similar words have similar
embeddings.
 GloVe (Global Vectors for Word Representation): Combines local
context and global statistical information from a text corpus to produce
word embeddings.
FastText: Similar to Word2Vec but considers subword information,

making it effective for handling rare words and morphological variations.
3. Contextualized Word Embeddings:
o Dynamic embeddings that take into account the context of a word in a sentence.
o Techniques include:
 ELMo (Embeddings from Language Models): Generates context-
sensitive embeddings using deep bidirectional LSTM networks.
 BERT (Bidirectional Encoder Representations from Transformers):
Uses transformer architecture to produce embeddings that are context-
dependent, considering both left and right context.
4. Sentence and Document Embeddings:
o Representing entire sentences or documents as single vectors.
o Techniques include:
 Doc2Vec: An extension of Word2Vec that generates embeddings for
larger text units like sentences and paragraphs.
 Universal Sentence Encoder (USE): Uses transformers or deep
averaging networks to produce embeddings for sentences.
 BERT-based models: Variants of BERT fine-tuned for producing
sentence or document-level embeddings, like Sentence-BERT.

Applications

 Text Classification: Assigning categories to text based on its content.

 Named Entity Recognition (NER): Identifying and classifying proper nouns in text.
 Machine Translation: Translating text from one language to another.
 Sentiment Analysis: Determining the sentiment expressed in a piece of text.
 Question Answering: Building systems that can answer questions based on text input.

Understanding words and their vector representations is crucial for building effective NLP
models and applications. If you have any specific questions or need further details on any of
these topics, feel free to ask!

Measuring Similarity in NLP

Measuring similarity in NLP is essential for various applications, such as information retrieval,
text classification, and clustering. Here are some common techniques and methods used to
measure similarity between text units:

Similarity Measures for Word Vectors

1. Cosine Similarity:
o Measures the cosine of the angle between two vectors.
o Formula: cosine_similarity(A,B)=A⋅B∥A∥∥B∥\text{cosine\_similarity}(A, B) = \
frac{A \cdot B}{\|A\| \|B\|}cosine_similarity(A,B)=∥A∥∥B∥A⋅B
o Values range from -1 to 1, where 1 indicates identical vectors, 0 indicates
orthogonal vectors, and -1 indicates opposite vectors.
2. Euclidean Distance:
o Measures the straight-line distance between two points in a vector space.
o Formula: euclidean_distance(A,B)=∑i=1n(Ai−Bi)2\text{euclidean\_distance}(A,
B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}euclidean_distance(A,B)=i=1∑n(Ai
−Bi)2
o Smaller distances indicate higher similarity.
3. Manhattan Distance (L1 Norm):
o Measures the sum of the absolute differences between the coordinates of the
vectors.
o Formula: manhattan_distance(A,B)=∑i=1n∣Ai−Bi∣\text{manhattan\_distance}(A,
B) = \sum_{i=1}^{n} |A_i - B_i|manhattan_distance(A,B)=i=1∑n∣Ai−Bi∣
o Often used when the differences between coordinates are more significant.
4. Jaccard Similarity:
o Measures similarity between two sets by dividing the size of their intersection by
the size of their union.
o Formula: jaccard_similarity(A,B)=∣A∩B∣∣A∪B∣\text{jaccard\_similarity}(A, B) =
\frac{|A \cap B|}{|A \cup B|}jaccard_similarity(A,B)=∣A∪B∣∣A∩B∣
o Values range from 0 to 1, where 1 indicates identical sets.

Similarity Measures for Sentences and Documents

1. Bag-of-Words (BoW):
o Represents text as a set of word counts.
o Similarity measures (e.g., cosine similarity) can be applied to BoW vectors.
2. TF-IDF (Term Frequency-Inverse Document Frequency):
o Adjusts the BoW model by weighting terms based on their importance in a
document relative to a corpus.
o Similarity measures (e.g., cosine similarity) can be applied to TF-IDF vectors.
3. Word Mover's Distance (WMD):
o Measures the distance between two documents by considering the minimum
cumulative distance required to move words from one document to another.
o Useful when using word embeddings.
4. Sentence Embeddings:
o Techniques like Universal Sentence Encoder (USE), Sentence-BERT, and
Doc2Vec generate embeddings for entire sentences or documents.
o Cosine similarity is commonly used to measure similarity between sentence
embeddings.

Contextual Similarity

1. Contextualized Word Embeddings:

o Models like BERT, ELMo, and GPT-3 produce embeddings that consider the
context of a word in a sentence.
o Cosine similarity can be used to compare these embeddings, providing a more
nuanced similarity measure.

Applications of Similarity Measures

1. Information Retrieval:
o Finding documents or passages that are similar to a query.
o Example: Search engines.
2. Text Classification:
o Grouping texts into predefined categories based on their content.
o Example: Spam detection in emails.
3. Clustering:
o Grouping similar texts into clusters.
o Example: Topic modeling.
4. Plagiarism Detection:
o Identifying duplicate or highly similar content across documents.
5. Recommender Systems:
o Suggesting items (e.g., articles, products) similar to what a user has interacted
with.
6. Paraphrase Detection:
o Identifying whether two sentences or texts express the same idea.

These measures and techniques help in analyzing and processing textual data by quantifying the
similarity between different text units.

Semantics with dense vectors

Dense vectors in NLP, such as Word2Vec, GloVe, and BERT embeddings, encode semantic
meaning effectively. These embeddings capture relationships between words and context, aiding
tasks like text classification, machine translation, and question answering by understanding
nuanced language semantics. They're trained to represent words or sentences in a continuous
vector space, where similar vectors denote similar meanings. This approach reduces
dimensionality compared to one-hot encoding, enhancing computational efficiency while
supporting transfer learning across diverse NLP applications.

Dense vectors (embeddings) in NLP encode semantic meaning:

 Word Embeddings (Word2Vec, GloVe, FastText) capture relationships between words.

 Contextualized Embeddings (ELMo, BERT) provide context-aware representations.
 They enable accurate text classification, machine translation, and question answering by
understanding nuanced meanings.
 Applications include sentiment analysis, information retrieval, and named entity
recognition.
 Dense vectors reduce computational complexity compared to sparse representations like
one-hot encoding.
 Challenges include interpretability and addressing biases inherent in training data.

SVD (Singular Value Decomposition) and Latent Semantic Analysis (LSA) are
techniques used in NLP for generating embeddings that capture latent semantic information from
text data:

1. SVD (Singular Value Decomposition):

o SVD is a matrix factorization method that decomposes a matrix into singular
vectors and singular values.
o In NLP, it's applied to the term-document matrix to uncover latent semantic
relationships between terms and documents.
o Embeddings are derived from the reduced dimensions of the matrix, capturing
semantic associations based on co-occurrence patterns.
2. Latent Semantic Analysis (LSA):
o LSA is based on SVD and involves reducing the dimensions of the term-
document matrix to extract latent semantic relationships.
o It transforms words into a lower-dimensional space where semantically similar
words have closer embeddings.
o LSA embeddings are used in tasks such as information retrieval, text
classification, and synonym extraction.

Applications in NLP:

 Information Retrieval: SVD and LSA embeddings help in finding relevant documents
based on semantic similarity rather than exact keyword matching.
 Text Classification: Embeddings derived from SVD/LSA can enhance the classification
accuracy by capturing underlying semantic relationships between words.
 Semantic Search: These embeddings improve the effectiveness of search engines by
understanding the semantic meaning of queries and documents.
 Topic Modeling: SVD/LSA can identify underlying topics in a collection of documents
by clustering terms that are semantically related.

In summary, SVD and LSA provide methods for generating embeddings that capture latent
semantic information in NLP tasks, contributing to more effective and nuanced text analysis and
processing.
In NLP, both Skip-gram and Continuous Bag of Words (CBOW) are algorithms
used to train word embeddings, particularly within the Word2Vec framework. Here’s a concise
overview of each:

1. Continuous Bag of Words (CBOW):

o Objective: Predict a target word based on its context.
o Training: CBOW takes a context of surrounding words (e.g., "the cat sat on the")
to predict the target word ("mat").
o Advantages: Efficient training with small datasets, good for frequent words.
2. Skip-gram:
o Objective: Predict context words given a target word.
o Training: Skip-gram predicts the context words ("the", "cat", "on", "the") from
the target word ("sat").
o Advantages: Effective with larger datasets and capturing less frequent words.

Applications:

 Semantic Similarity: Both models learn embeddings where similar words have closer
vector representations, aiding in tasks like synonym detection.
 Analogical Reasoning: Models can compute relationships like "man" is to "woman" as
"king" is to "queen" through vector arithmetic.
 Natural Language Understanding: Used in various NLP tasks such as machine
translation, sentiment analysis, and named entity recognition.

Choosing Between Skip-gram and CBOW:

 CBOW is faster to train and generally performs better with frequent words and smaller
datasets.
 Skip-gram captures more nuanced relationships and is preferred for larger datasets with
diverse vocabulary.

Both models leverage the distributional hypothesis that words appearing in similar contexts tend
to have similar meanings, making them foundational for learning semantic representations in
NLP.

The concept of word sense in NLP is directly tied to word sense disambiguation (WSD). It
refers to the different meanings a word can have depending on the context in which it's used.

Here's a breakdown:
 Word Sense: Each distinct meaning a word can have is considered a word sense. For
instance, "bat" can refer to a flying mammal or a wooden club used in sports. These are
two distinct word senses of "bat."
 Word Sense Disambiguation (WSD): This is the process of identifying the intended
meaning of a word in a specific context. NLP systems grapple with ambiguity because
many words have multiple meanings. WSD helps determine the correct sense based on
surrounding words and the overall context.

Examples of Word Senses:

 Bank: Financial institution (e.g., "I went to the bank to deposit a check.") or the edge of a
river (e.g., "We sat by the river bank and enjoyed the sunset.")
 Light: Not heavy (e.g., "This box is very light.") or illumination (e.g., "Turn off the light
before you leave.")

Why is Word Sense Disambiguation Important?

 Accuracy: Accurate WSD is crucial for various NLP tasks like machine translation,
question answering, and sentiment analysis. The wrong sense can lead to nonsensical
translations, irrelevant answers, and misinterpreting the sentiment of a text.
 Context Matters: Consider the sentence "The code was cracked." Here, "cracked" could
mean broken (referring to a physical object) or deciphered (referring to a coded message).
WSD helps determine the intended meaning based on context.

Challenges of Word Sense Disambiguation:

 Homonymy: Homonyms are words with the same spelling and pronunciation but
different meanings (e.g., "bat"). Distinguishing between them requires deep contextual
understanding.
 Polysemy: A single word can have multiple related meanings (e.g., "light"). WSD needs
to identify the most relevant sense based on the context.

Approaches to Word Sense Disambiguation:

 Supervised Learning: Training models on manually annotated data where each word
occurrence has its designated sense.
 Unsupervised Learning: Analyzing word co-occurrence patterns and clustering words
based on similar contexts.
 Knowledge-based Methods: Leveraging resources like WordNet, a lexical database
containing semantic relationships between words.

Overall, word sense is a fundamental concept in NLP, and WSD plays a critical role in ensuring
accurate interpretation and processing of natural language.
Unit 4: NLP Applications and Case Studies Intelligent Work Processors: Machine Translation; User
Interfaces; man-machine Interfaces: Natural language Querying Tutoring and Authoring Systems. Speech
Recognition Commercial use of NLP: NLP in customer Service, Sentiment Analysis, Emotion Mining,
Handling Frauds and SMS, Bots, LSTM & BERT models, Conversations

Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
wordembed
No ratings yet
wordembed
31 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Part 3
No ratings yet
Part 3
5 pages
Unit iv
No ratings yet
Unit iv
57 pages
Unit-2-TB
No ratings yet
Unit-2-TB
20 pages
NLP_Module 2
No ratings yet
NLP_Module 2
54 pages
Embeddings
No ratings yet
Embeddings
3 pages
Akshay DBpedia GSoC 2017 Proposal
No ratings yet
Akshay DBpedia GSoC 2017 Proposal
12 pages
05. Vector Semantics and Embeddings
No ratings yet
05. Vector Semantics and Embeddings
29 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
No ratings yet
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
8 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Unit-2
No ratings yet
Unit-2
21 pages
Unit iv
No ratings yet
Unit iv
58 pages
CCS369 UNIT-2 20.12.24
No ratings yet
CCS369 UNIT-2 20.12.24
41 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
UNIT_5_DL
No ratings yet
UNIT_5_DL
11 pages
Lect04
No ratings yet
Lect04
44 pages
Text Similarity in Vector Space Models: A Comparative Study
No ratings yet
Text Similarity in Vector Space Models: A Comparative Study
17 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
week2and3
No ratings yet
week2and3
76 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Building Compact Entity Embeddings Using Wikidata
No ratings yet
Building Compact Entity Embeddings Using Wikidata
10 pages
Lab 5
No ratings yet
Lab 5
27 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
unit2
No ratings yet
unit2
15 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Three 150224 Generative a i Intro
No ratings yet
Three 150224 Generative a i Intro
19 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
10 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
Vector Semantics Embeddings PPT
No ratings yet
Vector Semantics Embeddings PPT
11 pages
CHATGPT NLP
No ratings yet
CHATGPT NLP
6 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
Sentiment Analysis based on vector embeding
No ratings yet
Sentiment Analysis based on vector embeding
5 pages
Fqiwefp
No ratings yet
Fqiwefp
2 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
No ratings yet
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
40 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Gen AI 1
No ratings yet
Gen AI 1
4 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Implemention paper A Hybrid Machine Learning Framework for Real 1.1
No ratings yet
Implemention paper A Hybrid Machine Learning Framework for Real 1.1
12 pages
ML Unit 4
No ratings yet
ML Unit 4
16 pages
What is the purpose of an Activity Diagram
No ratings yet
What is the purpose of an Activity Diagram
2 pages
Social Network Analysis
No ratings yet
Social Network Analysis
58 pages
NLP
No ratings yet
NLP
27 pages
Big Data Analytics
No ratings yet
Big Data Analytics
127 pages
Ethical & Responsible AI
No ratings yet
Ethical & Responsible AI
21 pages
Introduction to Research in Computer Science
No ratings yet
Introduction to Research in Computer Science
6 pages
DHX Gateway Driver Help
No ratings yet
DHX Gateway Driver Help
51 pages
F
No ratings yet
F
113 pages
Unit 1 Machine learning aktu
No ratings yet
Unit 1 Machine learning aktu
10 pages
Untitled
No ratings yet
Untitled
44 pages
BET Manual ENG (2020)
No ratings yet
BET Manual ENG (2020)
112 pages
Synopsis Format
No ratings yet
Synopsis Format
19 pages
ETL Overview • The ETL Process • General ETL issue
No ratings yet
ETL Overview • The ETL Process • General ETL issue
5 pages
Dinesh Kumar Updated Resume
No ratings yet
Dinesh Kumar Updated Resume
6 pages
Computational Fracture Mechanics-Exercise Manual 2013
No ratings yet
Computational Fracture Mechanics-Exercise Manual 2013
32 pages
EWARM MigratingFromCCS - ENU
No ratings yet
EWARM MigratingFromCCS - ENU
2 pages
Online Blood Bank Management System Project Report
No ratings yet
Online Blood Bank Management System Project Report
10 pages
Gfps Instruction Manual Msa 2 CF
No ratings yet
Gfps Instruction Manual Msa 2 CF
36 pages
Information Theory Lecture Notes
100% (1)
Information Theory Lecture Notes
97 pages
Bca VI May2017 Data Mining and Data Warehousing
No ratings yet
Bca VI May2017 Data Mining and Data Warehousing
2 pages
Apply Online: Access Reserved To Referees: General Information On Referee
No ratings yet
Apply Online: Access Reserved To Referees: General Information On Referee
1 page
Awareness of Industry4 Questionnaire
No ratings yet
Awareness of Industry4 Questionnaire
3 pages
Umbrella Corporation Project Proposal
No ratings yet
Umbrella Corporation Project Proposal
21 pages
12112229_mpi_lab_file[1]
No ratings yet
12112229_mpi_lab_file[1]
29 pages
Cratos Manual 2 0
No ratings yet
Cratos Manual 2 0
45 pages
s1 Am030 A GC CTRL SRV Complete 001 116
No ratings yet
s1 Am030 A GC CTRL SRV Complete 001 116
117 pages
Server Inlet Temperature and Humidity Adjustments - Products - ENERGY STAR
No ratings yet
Server Inlet Temperature and Humidity Adjustments - Products - ENERGY STAR
5 pages
Intel Core Desktop Boxed Processors Comparison Chart
No ratings yet
Intel Core Desktop Boxed Processors Comparison Chart
11 pages
Mehta Model Pl-1325 Plasma Table Structure: Technical Report
No ratings yet
Mehta Model Pl-1325 Plasma Table Structure: Technical Report
7 pages
4.ethics and Safety Measures
No ratings yet
4.ethics and Safety Measures
8 pages
imageRUNNERFirmwareChart 2
100% (1)
imageRUNNERFirmwareChart 2
3 pages
SIMATIC PCS 7 V9.1 SP2, SIMIT Simulation V11.0 SP1
No ratings yet
SIMATIC PCS 7 V9.1 SP2, SIMIT Simulation V11.0 SP1
89 pages
Performance Management Systems - Proposing and Tes
No ratings yet
Performance Management Systems - Proposing and Tes
14 pages
Reaper User Guide 702 CC
0% (1)
Reaper User Guide 702 CC
462 pages
Croma Presentation
No ratings yet
Croma Presentation
13 pages
Data World: Represents A Real World Which Is Changing Continuously - 3 Data Worlds
No ratings yet
Data World: Represents A Real World Which Is Changing Continuously - 3 Data Worlds
41 pages

NLP 2

Uploaded by

NLP 2

Uploaded by

MODULE - II Unit 3:Semantics and Word Embedding

Words and Vector in NLP

 Text Classification: Assigning categories to text based on its content.

Measuring Similarity in NLP

Similarity Measures for Word Vectors

Similarity Measures for Sentences and Documents

1. Contextualized Word Embeddings:

Applications of Similarity Measures

Semantics with dense vectors

Dense vectors (embeddings) in NLP encode semantic meaning:

 Word Embeddings (Word2Vec, GloVe, FastText) capture relationships between words.

1. SVD (Singular Value Decomposition):

1. Continuous Bag of Words (CBOW):

Choosing Between Skip-gram and CBOW:

Examples of Word Senses:

Why is Word Sense Disambiguation Important?

Challenges of Word Sense Disambiguation:

Approaches to Word Sense Disambiguation:

You might also like