An Embedding is Worth 1000 Words
Start using Word Embeddings in your Business!
Jaya Zenchenko
March 15, 2018
Women in Analytics
About Me
- BA Math
- MS Math Education
- MS Applied Math
- Research Scientist
- DoD: Images, Video, Graphs
- Data Scientist
- NLP, Unsupervised Learning
- Data Science Manager
- Images, Text
- Founder of “Women in Data Science - Austin” Meetup
@datanerd_jaya bijaya.zenchenko@gmail.com
Credit: http://briasimpson.com/wp-content/uploads/2013/04/Swamp-Overwhelm.jpg
Pop Quiz!
- What is an embedding and why do I care?
- Is an embedding really worth 1000 words?
- What does it mean for words to be related?
- What are some gotchas in dealing with text data?
- What approaches can I use to get insights from my text data?
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business
Word Embeddings
- Coined in 2003
- Also called “Word Vectors”
- Text being converted into numbers
- Not just for words! Sentences, documents, etc.
- Not magic!
Why am I here?
- Word Embeddings are cool!
- No magic
- Highlight built in functionality of “gensim” and “scikit-learn” python
package to quickly tackle your next text based project
What Can We Do?
- Get insights
- Search/Retrieval
- Clustering (grouping)
- Identify Topics (i.e. themes)
- Apply techniques to other data sets (images, click through, etc)!!
Definitions
Vector/Embedding
Vector or Embedding
2-dimensional vector = [2, 3]
3-dimensional vector = [2, 3, 1]
N-dimensional vector = [2, 3, 1, 5, …, nth value]
Credit: https://maths-wiki.wikispaces.com/Vectors
Clustering
Grouping points that are close - what is “close”?
Credit: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
Text Based Similarity
Credit: https://alexn.org/blog/2012/01/16/cosine-similarity-euclidean-distance.html
Topics
Themes found in our data. I.e. “restaurants”, “amenities”, “activities”
Credit: https://nlpforhackers.io/topic-modeling/
Brief History of Word Embeddings
- 17th century - philosophers such as Leibniz and Descartes put forward
proposal for codes to relate words in different languages
- 1920s patent filed by Emanuel Goldberg
- “Statistical Machine that searched for documents stored on film”
Credit: http://museen-dresden.de/index.php?lang=de&node=termine&resartium=events&tempus=week&locus=technischesammlungen&event=2680
Resource: https://en.wikipedia.org/wiki/History_of_natural_language_processing
Brief History of Word Embeddings
- 1945 - Vannevar Bush - Inspired by Goldberg
- Desire for collective “memory” machine to make knowledge accessible
Credit: https://en.wikipedia.org/wiki/Vannevar_Bush
Themes In History - NLP
- Machine Translation
- Information Retrieval
- Language Understanding
“You shall know a word by the company it keeps — J. R. Firth (1957)”
Distributional hypothesis
“You shall know a word by the company it keeps — J. R. Firth (1957)”
Words that occur in similar contexts tend to have similar meanings
(Harris, 1954)
Similar Meaning
- Words that occur in similar contexts tend to have similar meanings
Type of “Similarity” Definition Examples
Semantic Relatedness Any relation between words Car, Road
Bee, Honey
Semantic Similarity Words used in the same way Car, Auto
Doctor, Nurse
Many more in Computational Linguistics!
Resource: https://arxiv.org/pdf/1003.1141.pdf
Context
- Words that occur in similar contexts tend to have similar meanings
- What is context?
- A word?
- A sentence?
- A whole document?
- A group (or window) or words?
- ...
Similar
- Words that occur in similar contexts tend to have similar meanings
- What does ‘similar context’ mean mathematically?
- Count words that appear together in context?
- Should we count words within the context]?
- Count how far apart they are?
- Weight them?
- …
- Thinking about “context”, “similar”, “meaning” in so many ways leads to
evolution of different word embeddings
- Words that occur in similar contexts tend to have similar meanings
- What is a word??
- Words when printed are letters surrounded by white space and/or
end of sentence punctuations (, . ? !)
- Words are combined to form sentences that follow language rules
- What is a sentence??
EASY!
Separate the text by ‘.’ , ‘!’, ‘?’
Dr. Ford did not ask Col. Mustard the name of Mr. Smith’s dog.
Data Preprocessing - Clean and Tokenize
- Very important - your results may change drastically
- Tokenize - to split text into “tokens”
- Required for gensim
- To keep or not to keep:
- Numbers
- Punctuation
- Stop words (Common words - no universal list)
- Sparse words
- HTML tags
- ...
- Other languages may tokenize differently!
- “i made her duck”
Credit: https://emojipedia.org/duck/
https://design.tutsplus.com/tutorials/how-to-animate-a-character-throwing-a-ball--cms-26207
Named Entity Extraction - Annotate Example
- “i love my apple”
Credit: https://www.dabur.com/realfruitpower/fruit-juices/apple-fruit-history
https://www.apple.com/shop/product/MK0C2AM/A/apple-pencil-for-ipad-pro
Data Preprocessing - Annotate
- Annotate
- Part of speech (Noun, Verb, etc)
- Named entity recognition (Organization, Money, Time, Locations, etc.)
- Choose to append, keep or ignore
More at https://spacy.io/usage/linguistic-features#section-named-entities
Data Preprocessing - Reduce Words
- Stemming
- Reduce word to “word stem”
- Crude heuristics to chop off word endings
- Many approaches - Porter Stemming in Gensim
- Fast!
- Running -> run
- Lemmatize
- Properly reduce the word based on part of speech annotation
- Available in gensim
- Slow
- Better -> good
- Both differ based on language!
Resource: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
Preprocessing Example
- Remove stop words and stem
Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
Embeddings
- Word
- Context (sentence, document, etc)
One-Hot-Encoding
- Convert words to vector - available in scikit-learn
quick 1 0 0 0 0
brown 0 1 0 0 0
dog 0 0 1 0 0
jump 0 0 0 1 0
lazy 0 0 0 0 1
- Convert document to vector - 1 if the word exists, 0 otherwise
- Available in scikit-learn
Boolean Embedding
Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
1 1 1 1 1
- “Term Frequency” - count the frequency of the word in a document
- “Count Vectorizer” in scikit-learn
Bag of Words
Document Term Matrix
- Term frequency for multiple documents
- “Count Vectorizer” in scikit-learn
Credit: http://ryanheuser.org/word-vectors-2/
Weighting
- Why do we want to weight words?
- “The”, “and”, …
- Term Frequency Inverse Document Frequency (TFIDF)
- Reduce the weight of very common words that appear in many of the documents
- Applies to Document-Term Matrix
Credit: http://trimc-nlp.blogspot.com/2013/04/tfidf-with-google-n-grams-and-pos-tags.html
Word Co-Occurrence Matrix
Word-Word Matrix for a given Context Window (# words to consider on each
side).
Word Co-Occurrence Matrix
Context Window Size = 1
Credit: https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817
Weighting
- Why do we want to weight word pairs?
- “New” York” vs “in” “the”
- Pointwise Mutual Information (PMI)
- Higher weight for mutually common words that are infrequent
Pointwise Mutual Information (PMI)
Credit: https://en.wikipedia.org/wiki/Pointwise_mutual_information
Topic Embedding - Latent Spaces
One approach: Latent Semantic Analysis (LSA) - Singular Value Decomposition (SVD) on the
Document-Term Matrix or TFIDF Weighted Matrix
Credit: https://tex.stackexchange.com/questions/258811/diagram-for-svd
topics
topicstopics
Topic Embeddings
- LSA/LSI - Latent Semantic Analysis or Indexing
- Used for Search and Retrieval
- Can only capture linear relationships
- Use Non-Negative Matrix Factorization for “understandable” topics
- LDA (Latent Dirichlet Allocation)
- Can capture non-linear relationships
- Guided LDA (Semi-Supervised LDA)
- Seed the topics with a few words!
Prediction Based Embeddings
- Frequency Based -> Prediction Based
- Neural architecture to train
- Predict words based on other words (or characters!)
Neural Word Embedding - Word2Vec
vec(“king”) - vec(“man”) + vec(“woman”) = vec(“queen”)
Trained on Google News Data - 3 million words and phrases.
Credit: https://www.slideshare.net/dnldimitri/student-lecture-for-master-course-on-d-dimitrihermans
Popular Neural Word Embeddings by Mikolov
- Word2Vec (2013) - Better at semantic similarity
- brother : sister
- fastText - Better at syntactic similarity due to character n-grams
- great : greater
- Similar architecture for both - one trained on words, the other on
character n-grams
Resources: https://en.wikipedia.org/wiki/Word2vec
Pretrained Embeddings
- Quickly leverage pretrained embeddings trained on different data sets
(google news, wikipedia, etc)
- Many available in baseline gensim package
- To gain deeper domain specific insights - train your own model !
- Additional models available to download - different languages, etc
https://github.com/Hironsan/awesome-embedding-models
Popular Neural Word Embedding
- GloVe (2014) - by Pennington, Socher, Manning
- Closest to a variant of word co-occurrence matrix
- Pre-trained model available in gensim
- Comparison of Word2Vec and GloVe by Radim Řehůřek's (gensim author)
Resource: https://nlp.stanford.edu/projects/glove/
Breaking News!
- Omar Levy proves Word2Vec is basically SVD on Pointwise Mutual
Information (PMI) weighted Co-Occurrence word matrix!
- Levy NIPS 2014 - “Neural Word Embeddings as Implicit Matrix
Factorization”
- Chris Moody @ StichFix - Oct 2017 - “Stop Using Word2Vec”
Credit: http://www.keepcalmandposters.com/poster/5718140_keep_calm_and_show_me_the_data
Why Word Embeddings?
- Too much text data!
- I want to know
Credit: https://blog.beeminder.com/allthethings/
Data & Packages
Credit: https://disneyworld.disney.go.com/entertainment/magic-kingdom/move-it-shake-it-dance-play-it
https://pragmaticarchitect.files.wordpress.com/2013/06/mabi87.png
Example of Data Preprocessing
Search Example
Code to Create Embeddings
Models Created
- Word2Vec - window size = 3, window size = 10
- fastText - window size = 3, window size = 10
fastText FTW
* “most_similar” presents ordered words, similarity score
Credit: https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285
Where Can I Go?
Find me some grub!
Clustering
- Clustered the word embeddings using KMeans clustering
- Showing the word size in wordcloud based on word weight from TFIDF
- Results from Word2Vec - Window Size = 3
What do people love?
What do people hate?
How to stock the house?
What activities?
Who do people travel with?
When do they travel?
Window Size Matters!
Credit: https://machinelearningmastery.com/what-are-word-embeddings/
Surprise me!
Dirty Data!
Pop Quiz!
- What does it mean for words to be related?
- Many things - semantic, syntactic, similar, related, etc
- What is an embedding and why do I care?
- Way to convert text data into numbers so we can use computers to do the work
- Is an embedding really worth 1000 words?
- It can be worth 10000 words! Based on how big the entire data set is
- What are some gotchas in dealing with text data?
- Preprocessing, window size, and other hyperparameters in using Word2Vec or FastText
- What approaches can I use to find insights in my data?
- Semantic Indexing, similar words, clustering, word2vec, fasttext
Thank you!
Questions?
Appendix
Embedding Pros Cons
Bag-of-Words Simple, fast Sparse, high dimension
Does not capture position in text
Does not capture semantics
TFIDF Easy to compute
Easily compare similarity between 2
documents
Dense, high dimension
Does not capture position in text
Does not capture semantics
Topic Space Lower dimension
Captures semantics
Handles synonyms in documents
Used for search/retrieval
Number of topics needs to be defined
Could be slow in high dimension
Topics need to be “hand labeled”
Word2Vec Can leverage pretrained models
Understand relationships between
words
Better for analogies
Inability to handle unseen words
Active research - To go from word vectors to
sentence vectors
fastText Character based
Can deal with unseen words
Can leverage pretrained models
Longer to train than Word2Vec
Active research - To go from word vectors to
sentence vectors
Resources..
- https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html
- https://web.stanford.edu/class/cs124/
- https://datascience.stackexchange.com/questions/11402/preprocessing-text-before-use-rnn/11421
- https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
- http://ruder.io/word-embeddings-1/index.html
- https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
- https://www.ischool.utexas.edu/~ssoy/organizing/l391d2c.htm
- https://en.wikipedia.org/wiki/Information_retrieval
- https://en.wikipedia.org/wiki/As_We_May_Think
- https://en.wikipedia.org/wiki/Latent_semantic_analysis
- https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf - “A survey of text similarity”
- http://www.jair.org/media/2934/live-2934-4846-jair.pdf
More Resources...
- https://cs224d.stanford.edu/lecture_notes/notes1.pdf
- https://www.quora.com/What-are-the-advantages-and-disadvantages-of-TF-IDF
- https://simonpaarlberg.com/post/latent-semantic-analyses/
- https://www.quora.com/What-are-the-advantages-and-disadvantages-of-Latent-Semantic-Analysis
- http://elliottash.com/wp-content/uploads/2017/07/Text-class-05-word-embeddings-1.pdf
- http://ruder.io/secret-word2vec/index.html#addingcontextvectors
- https://www.linkedin.com/pulse/what-main-difference-between-word2vec-fasttext-federico-cesconi/
- https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
Highlight of Gensim Functions
- Gensim:
- parsing.preprocessing
- models.tfidfmodel
- models.lsimodel
- models.word2vec
- models.fastText
- models.keyedvectors
- Descriptions at: https://radimrehurek.com/gensim/apiref.html
- Tutorials at: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
Highlight of Other Python Functions
- Sklearn:
- TfidfVectorizer
- KMeans Clustering
- Incredible documentation overall : http://scikit-learn.org/stable/index.html
- Wordcloud:
- https://github.com/amueller/word_cloud/tree/master/wordcloud
Open Source Tool for Text Search
- Really Fast!!
- Built on Lucene - Apache Project - almost 20 years old and still evolving
- Lucene - set the standard for search and indexing
Example Code - Data Preprocessing
Example Code - TFIDF and Topic Embedding (LSA)
Semantic
Syntactic
Semantic vs Syntactic
Credit: https://arxiv.org/pdf/1301.3781v3.pdf
Fun Reviews - Amazon

More Related Content

PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PPTX
Word embeddings
PPTX
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
PPTX
A Simple Introduction to Word Embeddings
PPTX
Vectorization In NLP.pptx
PPTX
Text Mining for Lexicography
PPTX
A Panorama of Natural Language Processing
PDF
Challenges in transfer learning in nlp
Embedding for fun fumarola Meetup Milano DLI luglio
Word embeddings
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
A Simple Introduction to Word Embeddings
Vectorization In NLP.pptx
Text Mining for Lexicography
A Panorama of Natural Language Processing
Challenges in transfer learning in nlp

Similar to Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business (20)

PDF
Yoav Goldberg: Word Embeddings What, How and Whither
PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
PPTX
Tutorial on word2vec
PPTX
Pycon ke word vectors
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PDF
Word Embeddings - Introduction
PPTX
word vector embeddings in natural languag processing
PPTX
Designing, Visualizing and Understanding Deep Neural Networks
PDF
Effect of word embedding vector dimensionality on sentiment analysis through ...
PPTX
Word vectors
PPTX
NLP Introduction and basics of natural language processing
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PPTX
Text Classification Using Machine Learning.pptx
PDF
Deep learning for nlp
PDF
Word2vec ultimate beginner
PPTX
Interpreting Embeddings with Comparison
PPTX
Word embedding
PPTX
What is word2vec?
PDF
David Barber - Deep Nets, Bayes and the story of AI
Yoav Goldberg: Word Embeddings What, How and Whither
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
Tutorial on word2vec
Pycon ke word vectors
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Word Embeddings - Introduction
word vector embeddings in natural languag processing
Designing, Visualizing and Understanding Deep Neural Networks
Effect of word embedding vector dimensionality on sentiment analysis through ...
Word vectors
NLP Introduction and basics of natural language processing
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
Text Classification Using Machine Learning.pptx
Deep learning for nlp
Word2vec ultimate beginner
Interpreting Embeddings with Comparison
Word embedding
What is word2vec?
David Barber - Deep Nets, Bayes and the story of AI
Ad

More from Rehgan Avon (9)

PDF
Ezgi Karaesmen - Data Cleaning and Manipulation with R
PDF
Dr. Karen Amstutz - Digitizing Health: How Analytics are Disrupting Healthca...
PDF
Amanda Cinnamon - Treat Your Code Like the Valuable Software It Is
PDF
Cheryl Wiebe - Advanced Analytics in the Industrial World
PDF
Wei Xu - Innovative Applications of AI Panel
PPTX
Helen Patton - Governing Big Data: Security, Privacy & Data Management
PPT
Dr. Lara Sucheston-Campbell - Building a working farm: Planning and planting ...
PDF
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
PDF
BDAA_Newsletter
Ezgi Karaesmen - Data Cleaning and Manipulation with R
Dr. Karen Amstutz - Digitizing Health: How Analytics are Disrupting Healthca...
Amanda Cinnamon - Treat Your Code Like the Valuable Software It Is
Cheryl Wiebe - Advanced Analytics in the Industrial World
Wei Xu - Innovative Applications of AI Panel
Helen Patton - Governing Big Data: Security, Privacy & Data Management
Dr. Lara Sucheston-Campbell - Building a working farm: Planning and planting ...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
BDAA_Newsletter
Ad

Recently uploaded (20)

PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
Machine Learning and working of machine Learning
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PDF
Best Data Science Professional Certificates in the USA | IABAC
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
MBA JAPAN: 2025 the University of Waseda
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPT
Image processing and pattern recognition 2.ppt
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
New ISO 27001_2022 standard and the changes
statsppt this is statistics ppt for giving knowledge about this topic
IMPACT OF LANDSLIDE.....................
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
Machine Learning and working of machine Learning
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
CYBER SECURITY the Next Warefare Tactics
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
Best Data Science Professional Certificates in the USA | IABAC
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
Caseware_IDEA_Detailed_Presentation.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
MBA JAPAN: 2025 the University of Waseda
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Image processing and pattern recognition 2.ppt
A biomechanical Functional analysis of the masitary muscles in man
New ISO 27001_2022 standard and the changes

Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business

  • 1. An Embedding is Worth 1000 Words Start using Word Embeddings in your Business! Jaya Zenchenko March 15, 2018 Women in Analytics
  • 2. About Me - BA Math - MS Math Education - MS Applied Math - Research Scientist - DoD: Images, Video, Graphs - Data Scientist - NLP, Unsupervised Learning - Data Science Manager - Images, Text - Founder of “Women in Data Science - Austin” Meetup @datanerd_jaya [email protected]
  • 4. Pop Quiz! - What is an embedding and why do I care? - Is an embedding really worth 1000 words? - What does it mean for words to be related? - What are some gotchas in dealing with text data? - What approaches can I use to get insights from my text data?
  • 7. Word Embeddings - Coined in 2003 - Also called “Word Vectors” - Text being converted into numbers - Not just for words! Sentences, documents, etc. - Not magic!
  • 8. Why am I here? - Word Embeddings are cool! - No magic - Highlight built in functionality of “gensim” and “scikit-learn” python package to quickly tackle your next text based project
  • 9. What Can We Do? - Get insights - Search/Retrieval - Clustering (grouping) - Identify Topics (i.e. themes) - Apply techniques to other data sets (images, click through, etc)!!
  • 11. Vector/Embedding Vector or Embedding 2-dimensional vector = [2, 3] 3-dimensional vector = [2, 3, 1] N-dimensional vector = [2, 3, 1, 5, …, nth value] Credit: https://maths-wiki.wikispaces.com/Vectors
  • 12. Clustering Grouping points that are close - what is “close”? Credit: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
  • 13. Text Based Similarity Credit: https://alexn.org/blog/2012/01/16/cosine-similarity-euclidean-distance.html
  • 14. Topics Themes found in our data. I.e. “restaurants”, “amenities”, “activities” Credit: https://nlpforhackers.io/topic-modeling/
  • 15. Brief History of Word Embeddings - 17th century - philosophers such as Leibniz and Descartes put forward proposal for codes to relate words in different languages - 1920s patent filed by Emanuel Goldberg - “Statistical Machine that searched for documents stored on film” Credit: http://museen-dresden.de/index.php?lang=de&node=termine&resartium=events&tempus=week&locus=technischesammlungen&event=2680 Resource: https://en.wikipedia.org/wiki/History_of_natural_language_processing
  • 16. Brief History of Word Embeddings - 1945 - Vannevar Bush - Inspired by Goldberg - Desire for collective “memory” machine to make knowledge accessible Credit: https://en.wikipedia.org/wiki/Vannevar_Bush
  • 17. Themes In History - NLP - Machine Translation - Information Retrieval - Language Understanding
  • 18. “You shall know a word by the company it keeps — J. R. Firth (1957)”
  • 19. Distributional hypothesis “You shall know a word by the company it keeps — J. R. Firth (1957)” Words that occur in similar contexts tend to have similar meanings (Harris, 1954)
  • 20. Similar Meaning - Words that occur in similar contexts tend to have similar meanings Type of “Similarity” Definition Examples Semantic Relatedness Any relation between words Car, Road Bee, Honey Semantic Similarity Words used in the same way Car, Auto Doctor, Nurse Many more in Computational Linguistics! Resource: https://arxiv.org/pdf/1003.1141.pdf
  • 21. Context - Words that occur in similar contexts tend to have similar meanings - What is context? - A word? - A sentence? - A whole document? - A group (or window) or words? - ...
  • 22. Similar - Words that occur in similar contexts tend to have similar meanings - What does ‘similar context’ mean mathematically? - Count words that appear together in context? - Should we count words within the context]? - Count how far apart they are? - Weight them? - … - Thinking about “context”, “similar”, “meaning” in so many ways leads to evolution of different word embeddings
  • 23. - Words that occur in similar contexts tend to have similar meanings - What is a word?? - Words when printed are letters surrounded by white space and/or end of sentence punctuations (, . ? !) - Words are combined to form sentences that follow language rules
  • 24. - What is a sentence?? EASY! Separate the text by ‘.’ , ‘!’, ‘?’ Dr. Ford did not ask Col. Mustard the name of Mr. Smith’s dog.
  • 25. Data Preprocessing - Clean and Tokenize - Very important - your results may change drastically - Tokenize - to split text into “tokens” - Required for gensim - To keep or not to keep: - Numbers - Punctuation - Stop words (Common words - no universal list) - Sparse words - HTML tags - ... - Other languages may tokenize differently!
  • 26. - “i made her duck” Credit: https://emojipedia.org/duck/ https://design.tutsplus.com/tutorials/how-to-animate-a-character-throwing-a-ball--cms-26207
  • 27. Named Entity Extraction - Annotate Example - “i love my apple” Credit: https://www.dabur.com/realfruitpower/fruit-juices/apple-fruit-history https://www.apple.com/shop/product/MK0C2AM/A/apple-pencil-for-ipad-pro
  • 28. Data Preprocessing - Annotate - Annotate - Part of speech (Noun, Verb, etc) - Named entity recognition (Organization, Money, Time, Locations, etc.) - Choose to append, keep or ignore More at https://spacy.io/usage/linguistic-features#section-named-entities
  • 29. Data Preprocessing - Reduce Words - Stemming - Reduce word to “word stem” - Crude heuristics to chop off word endings - Many approaches - Porter Stemming in Gensim - Fast! - Running -> run - Lemmatize - Properly reduce the word based on part of speech annotation - Available in gensim - Slow - Better -> good - Both differ based on language! Resource: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
  • 30. Preprocessing Example - Remove stop words and stem Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
  • 31. Embeddings - Word - Context (sentence, document, etc)
  • 32. One-Hot-Encoding - Convert words to vector - available in scikit-learn quick 1 0 0 0 0 brown 0 1 0 0 0 dog 0 0 1 0 0 jump 0 0 0 1 0 lazy 0 0 0 0 1
  • 33. - Convert document to vector - 1 if the word exists, 0 otherwise - Available in scikit-learn Boolean Embedding Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759 1 1 1 1 1
  • 34. - “Term Frequency” - count the frequency of the word in a document - “Count Vectorizer” in scikit-learn Bag of Words
  • 35. Document Term Matrix - Term frequency for multiple documents - “Count Vectorizer” in scikit-learn Credit: http://ryanheuser.org/word-vectors-2/
  • 36. Weighting - Why do we want to weight words? - “The”, “and”, … - Term Frequency Inverse Document Frequency (TFIDF) - Reduce the weight of very common words that appear in many of the documents - Applies to Document-Term Matrix
  • 38. Word Co-Occurrence Matrix Word-Word Matrix for a given Context Window (# words to consider on each side).
  • 39. Word Co-Occurrence Matrix Context Window Size = 1 Credit: https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817
  • 40. Weighting - Why do we want to weight word pairs? - “New” York” vs “in” “the” - Pointwise Mutual Information (PMI) - Higher weight for mutually common words that are infrequent
  • 41. Pointwise Mutual Information (PMI) Credit: https://en.wikipedia.org/wiki/Pointwise_mutual_information
  • 42. Topic Embedding - Latent Spaces One approach: Latent Semantic Analysis (LSA) - Singular Value Decomposition (SVD) on the Document-Term Matrix or TFIDF Weighted Matrix Credit: https://tex.stackexchange.com/questions/258811/diagram-for-svd topics topicstopics
  • 43. Topic Embeddings - LSA/LSI - Latent Semantic Analysis or Indexing - Used for Search and Retrieval - Can only capture linear relationships - Use Non-Negative Matrix Factorization for “understandable” topics - LDA (Latent Dirichlet Allocation) - Can capture non-linear relationships - Guided LDA (Semi-Supervised LDA) - Seed the topics with a few words!
  • 44. Prediction Based Embeddings - Frequency Based -> Prediction Based - Neural architecture to train - Predict words based on other words (or characters!)
  • 45. Neural Word Embedding - Word2Vec vec(“king”) - vec(“man”) + vec(“woman”) = vec(“queen”) Trained on Google News Data - 3 million words and phrases. Credit: https://www.slideshare.net/dnldimitri/student-lecture-for-master-course-on-d-dimitrihermans
  • 46. Popular Neural Word Embeddings by Mikolov - Word2Vec (2013) - Better at semantic similarity - brother : sister - fastText - Better at syntactic similarity due to character n-grams - great : greater - Similar architecture for both - one trained on words, the other on character n-grams Resources: https://en.wikipedia.org/wiki/Word2vec
  • 47. Pretrained Embeddings - Quickly leverage pretrained embeddings trained on different data sets (google news, wikipedia, etc) - Many available in baseline gensim package - To gain deeper domain specific insights - train your own model ! - Additional models available to download - different languages, etc https://github.com/Hironsan/awesome-embedding-models
  • 48. Popular Neural Word Embedding - GloVe (2014) - by Pennington, Socher, Manning - Closest to a variant of word co-occurrence matrix - Pre-trained model available in gensim - Comparison of Word2Vec and GloVe by Radim Řehůřek's (gensim author) Resource: https://nlp.stanford.edu/projects/glove/
  • 49. Breaking News! - Omar Levy proves Word2Vec is basically SVD on Pointwise Mutual Information (PMI) weighted Co-Occurrence word matrix! - Levy NIPS 2014 - “Neural Word Embeddings as Implicit Matrix Factorization” - Chris Moody @ StichFix - Oct 2017 - “Stop Using Word2Vec”
  • 51. Why Word Embeddings? - Too much text data! - I want to know Credit: https://blog.beeminder.com/allthethings/
  • 52. Data & Packages Credit: https://disneyworld.disney.go.com/entertainment/magic-kingdom/move-it-shake-it-dance-play-it https://pragmaticarchitect.files.wordpress.com/2013/06/mabi87.png
  • 53. Example of Data Preprocessing
  • 55. Code to Create Embeddings
  • 56. Models Created - Word2Vec - window size = 3, window size = 10 - fastText - window size = 3, window size = 10
  • 57. fastText FTW * “most_similar” presents ordered words, similarity score
  • 59. Where Can I Go?
  • 60. Find me some grub!
  • 61. Clustering - Clustered the word embeddings using KMeans clustering - Showing the word size in wordcloud based on word weight from TFIDF - Results from Word2Vec - Window Size = 3
  • 62. What do people love?
  • 63. What do people hate?
  • 64. How to stock the house?
  • 66. Who do people travel with?
  • 67. When do they travel?
  • 68. Window Size Matters! Credit: https://machinelearningmastery.com/what-are-word-embeddings/
  • 70. Pop Quiz! - What does it mean for words to be related? - Many things - semantic, syntactic, similar, related, etc - What is an embedding and why do I care? - Way to convert text data into numbers so we can use computers to do the work - Is an embedding really worth 1000 words? - It can be worth 10000 words! Based on how big the entire data set is - What are some gotchas in dealing with text data? - Preprocessing, window size, and other hyperparameters in using Word2Vec or FastText - What approaches can I use to find insights in my data? - Semantic Indexing, similar words, clustering, word2vec, fasttext
  • 73. Embedding Pros Cons Bag-of-Words Simple, fast Sparse, high dimension Does not capture position in text Does not capture semantics TFIDF Easy to compute Easily compare similarity between 2 documents Dense, high dimension Does not capture position in text Does not capture semantics Topic Space Lower dimension Captures semantics Handles synonyms in documents Used for search/retrieval Number of topics needs to be defined Could be slow in high dimension Topics need to be “hand labeled” Word2Vec Can leverage pretrained models Understand relationships between words Better for analogies Inability to handle unseen words Active research - To go from word vectors to sentence vectors fastText Character based Can deal with unseen words Can leverage pretrained models Longer to train than Word2Vec Active research - To go from word vectors to sentence vectors
  • 74. Resources.. - https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html - https://web.stanford.edu/class/cs124/ - https://datascience.stackexchange.com/questions/11402/preprocessing-text-before-use-rnn/11421 - https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf - http://ruder.io/word-embeddings-1/index.html - https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/ - https://www.ischool.utexas.edu/~ssoy/organizing/l391d2c.htm - https://en.wikipedia.org/wiki/Information_retrieval - https://en.wikipedia.org/wiki/As_We_May_Think - https://en.wikipedia.org/wiki/Latent_semantic_analysis - https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf - “A survey of text similarity” - http://www.jair.org/media/2934/live-2934-4846-jair.pdf
  • 75. More Resources... - https://cs224d.stanford.edu/lecture_notes/notes1.pdf - https://www.quora.com/What-are-the-advantages-and-disadvantages-of-TF-IDF - https://simonpaarlberg.com/post/latent-semantic-analyses/ - https://www.quora.com/What-are-the-advantages-and-disadvantages-of-Latent-Semantic-Analysis - http://elliottash.com/wp-content/uploads/2017/07/Text-class-05-word-embeddings-1.pdf - http://ruder.io/secret-word2vec/index.html#addingcontextvectors - https://www.linkedin.com/pulse/what-main-difference-between-word2vec-fasttext-federico-cesconi/ - https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
  • 76. Highlight of Gensim Functions - Gensim: - parsing.preprocessing - models.tfidfmodel - models.lsimodel - models.word2vec - models.fastText - models.keyedvectors - Descriptions at: https://radimrehurek.com/gensim/apiref.html - Tutorials at: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
  • 77. Highlight of Other Python Functions - Sklearn: - TfidfVectorizer - KMeans Clustering - Incredible documentation overall : http://scikit-learn.org/stable/index.html - Wordcloud: - https://github.com/amueller/word_cloud/tree/master/wordcloud
  • 78. Open Source Tool for Text Search - Really Fast!! - Built on Lucene - Apache Project - almost 20 years old and still evolving - Lucene - set the standard for search and indexing
  • 79. Example Code - Data Preprocessing
  • 80. Example Code - TFIDF and Topic Embedding (LSA)
  • 81. Semantic Syntactic Semantic vs Syntactic Credit: https://arxiv.org/pdf/1301.3781v3.pdf
  • 82. Fun Reviews - Amazon