Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business

An Embedding is Worth 1000 Words
Start using Word Embeddings in your Business!
Jaya Zenchenko
March 15, 2018
Women in Analytics

About Me
- BA Math
- MS Math Education
- MS Applied Math
- Research Scientist
- DoD: Images, Video, Graphs
- Data Scientist
- NLP, Unsupervised Learning
- Data Science Manager
- Images, Text
- Founder of “Women in Data Science - Austin” Meetup
@datanerd_jaya bijaya.zenchenko@gmail.com

Credit: http://briasimpson.com/wp-content/uploads/2013/04/Swamp-Overwhelm.jpg

Pop Quiz!
- What is an embedding and why do I care?
- Is an embedding really worth 1000 words?
- What does it mean for words to be related?
- What are some gotchas in dealing with text data?
- What approaches can I use to get insights from my text data?

Word Embeddings
- Coined in 2003
- Also called “Word Vectors”
- Text being converted into numbers
- Not just for words! Sentences, documents, etc.
- Not magic!

Why am I here?
- Word Embeddings are cool!
- No magic
- Highlight built in functionality of “gensim” and “scikit-learn” python
package to quickly tackle your next text based project

What Can We Do?
- Get insights
- Search/Retrieval
- Clustering (grouping)
- Identify Topics (i.e. themes)
- Apply techniques to other data sets (images, click through, etc)!!

Vector/Embedding
Vector or Embedding
2-dimensional vector = [2, 3]
3-dimensional vector = [2, 3, 1]
N-dimensional vector = [2, 3, 1, 5, …, nth value]
Credit: https://maths-wiki.wikispaces.com/Vectors

Clustering
Grouping points that are close - what is “close”?
Credit: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

Text Based Similarity
Credit: https://alexn.org/blog/2012/01/16/cosine-similarity-euclidean-distance.html

Topics
Themes found in our data. I.e. “restaurants”, “amenities”, “activities”
Credit: https://nlpforhackers.io/topic-modeling/

Brief History of Word Embeddings
- 17th century - philosophers such as Leibniz and Descartes put forward
proposal for codes to relate words in different languages
- 1920s patent filed by Emanuel Goldberg
- “Statistical Machine that searched for documents stored on film”
Credit: http://museen-dresden.de/index.php?lang=de&node=termine&resartium=events&tempus=week&locus=technischesammlungen&event=2680
Resource: https://en.wikipedia.org/wiki/History_of_natural_language_processing

Brief History of Word Embeddings
- 1945 - Vannevar Bush - Inspired by Goldberg
- Desire for collective “memory” machine to make knowledge accessible
Credit: https://en.wikipedia.org/wiki/Vannevar_Bush

Themes In History - NLP
- Machine Translation
- Information Retrieval
- Language Understanding

“You shall know a word by the company it keeps — J. R. Firth (1957)”

Distributional hypothesis
“You shall know a word by the company it keeps — J. R. Firth (1957)”
Words that occur in similar contexts tend to have similar meanings
(Harris, 1954)

Similar Meaning
- Words that occur in similar contexts tend to have similar meanings
Type of “Similarity” Definition Examples
Semantic Relatedness Any relation between words Car, Road
Bee, Honey
Semantic Similarity Words used in the same way Car, Auto
Doctor, Nurse
Many more in Computational Linguistics!
Resource: https://arxiv.org/pdf/1003.1141.pdf

Context
- What is context?
- A word?
- A sentence?
- A whole document?
- A group (or window) or words?
- ...

Similar
- What does ‘similar context’ mean mathematically?
- Count words that appear together in context?
- Should we count words within the context]?
- Count how far apart they are?
- Weight them?
- …
- Thinking about “context”, “similar”, “meaning” in so many ways leads to
evolution of different word embeddings

- What is a word??
- Words when printed are letters surrounded by white space and/or
end of sentence punctuations (, . ? !)
- Words are combined to form sentences that follow language rules

- What is a sentence??
EASY!
Separate the text by ‘.’ , ‘!’, ‘?’
Dr. Ford did not ask Col. Mustard the name of Mr. Smith’s dog.

Data Preprocessing - Clean and Tokenize
- Very important - your results may change drastically
- Tokenize - to split text into “tokens”
- Required for gensim
- To keep or not to keep:
- Numbers
- Punctuation
- Stop words (Common words - no universal list)
- Sparse words
- HTML tags
- ...
- Other languages may tokenize differently!

- “i made her duck”
Credit: https://emojipedia.org/duck/
https://design.tutsplus.com/tutorials/how-to-animate-a-character-throwing-a-ball--cms-26207

Named Entity Extraction - Annotate Example
- “i love my apple”
Credit: https://www.dabur.com/realfruitpower/fruit-juices/apple-fruit-history
https://www.apple.com/shop/product/MK0C2AM/A/apple-pencil-for-ipad-pro

Data Preprocessing - Annotate
- Annotate
- Part of speech (Noun, Verb, etc)
- Named entity recognition (Organization, Money, Time, Locations, etc.)
- Choose to append, keep or ignore
More at https://spacy.io/usage/linguistic-features#section-named-entities

Data Preprocessing - Reduce Words
- Stemming
- Reduce word to “word stem”
- Crude heuristics to chop off word endings
- Many approaches - Porter Stemming in Gensim
- Fast!
- Running -> run
- Lemmatize
- Properly reduce the word based on part of speech annotation
- Available in gensim
- Slow
- Better -> good
- Both differ based on language!
Resource: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Preprocessing Example
- Remove stop words and stem
Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759

Embeddings
- Word
- Context (sentence, document, etc)

One-Hot-Encoding
- Convert words to vector - available in scikit-learn
quick 1 0 0 0 0
brown 0 1 0 0 0
dog 0 0 1 0 0
jump 0 0 0 1 0
lazy 0 0 0 0 1

- Convert document to vector - 1 if the word exists, 0 otherwise
- Available in scikit-learn
Boolean Embedding
Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
1 1 1 1 1

- “Term Frequency” - count the frequency of the word in a document
- “Count Vectorizer” in scikit-learn
Bag of Words

Document Term Matrix
- Term frequency for multiple documents
- “Count Vectorizer” in scikit-learn
Credit: http://ryanheuser.org/word-vectors-2/

Weighting
- Why do we want to weight words?
- “The”, “and”, …
- Term Frequency Inverse Document Frequency (TFIDF)
- Reduce the weight of very common words that appear in many of the documents
- Applies to Document-Term Matrix

Credit: http://trimc-nlp.blogspot.com/2013/04/tfidf-with-google-n-grams-and-pos-tags.html

Word Co-Occurrence Matrix
Word-Word Matrix for a given Context Window (# words to consider on each
side).

Word Co-Occurrence Matrix
Context Window Size = 1
Credit: https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817

Weighting
- Why do we want to weight word pairs?
- “New” York” vs “in” “the”
- Pointwise Mutual Information (PMI)
- Higher weight for mutually common words that are infrequent

Pointwise Mutual Information (PMI)
Credit: https://en.wikipedia.org/wiki/Pointwise_mutual_information

Topic Embedding - Latent Spaces
One approach: Latent Semantic Analysis (LSA) - Singular Value Decomposition (SVD) on the
Document-Term Matrix or TFIDF Weighted Matrix
Credit: https://tex.stackexchange.com/questions/258811/diagram-for-svd
topics
topicstopics

Topic Embeddings
- LSA/LSI - Latent Semantic Analysis or Indexing
- Used for Search and Retrieval
- Can only capture linear relationships
- Use Non-Negative Matrix Factorization for “understandable” topics
- LDA (Latent Dirichlet Allocation)
- Can capture non-linear relationships
- Guided LDA (Semi-Supervised LDA)
- Seed the topics with a few words!

Prediction Based Embeddings
- Frequency Based -> Prediction Based
- Neural architecture to train
- Predict words based on other words (or characters!)

Neural Word Embedding - Word2Vec
vec(“king”) - vec(“man”) + vec(“woman”) = vec(“queen”)
Trained on Google News Data - 3 million words and phrases.
Credit: https://www.slideshare.net/dnldimitri/student-lecture-for-master-course-on-d-dimitrihermans

Popular Neural Word Embeddings by Mikolov
- Word2Vec (2013) - Better at semantic similarity
- brother : sister
- fastText - Better at syntactic similarity due to character n-grams
- great : greater
- Similar architecture for both - one trained on words, the other on
character n-grams
Resources: https://en.wikipedia.org/wiki/Word2vec

Pretrained Embeddings
- Quickly leverage pretrained embeddings trained on different data sets
(google news, wikipedia, etc)
- Many available in baseline gensim package
- To gain deeper domain specific insights - train your own model !
- Additional models available to download - different languages, etc
https://github.com/Hironsan/awesome-embedding-models

Popular Neural Word Embedding
- GloVe (2014) - by Pennington, Socher, Manning
- Closest to a variant of word co-occurrence matrix
- Pre-trained model available in gensim
- Comparison of Word2Vec and GloVe by Radim Řehůřek's (gensim author)
Resource: https://nlp.stanford.edu/projects/glove/

Breaking News!
- Omar Levy proves Word2Vec is basically SVD on Pointwise Mutual
Information (PMI) weighted Co-Occurrence word matrix!
- Levy NIPS 2014 - “Neural Word Embeddings as Implicit Matrix
Factorization”
- Chris Moody @ StichFix - Oct 2017 - “Stop Using Word2Vec”

Credit: http://www.keepcalmandposters.com/poster/5718140_keep_calm_and_show_me_the_data

Why Word Embeddings?
- Too much text data!
- I want to know
Credit: https://blog.beeminder.com/allthethings/

Data & Packages
Credit: https://disneyworld.disney.go.com/entertainment/magic-kingdom/move-it-shake-it-dance-play-it
https://pragmaticarchitect.files.wordpress.com/2013/06/mabi87.png

Models Created
- Word2Vec - window size = 3, window size = 10
- fastText - window size = 3, window size = 10

fastText FTW
* “most_similar” presents ordered words, similarity score

Credit: https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285

Clustering
- Clustered the word embeddings using KMeans clustering
- Showing the word size in wordcloud based on word weight from TFIDF
- Results from Word2Vec - Window Size = 3

Window Size Matters!
Credit: https://machinelearningmastery.com/what-are-word-embeddings/

Pop Quiz!
- What does it mean for words to be related?
- Many things - semantic, syntactic, similar, related, etc
- What is an embedding and why do I care?
- Way to convert text data into numbers so we can use computers to do the work
- Is an embedding really worth 1000 words?
- It can be worth 10000 words! Based on how big the entire data set is
- What are some gotchas in dealing with text data?
- Preprocessing, window size, and other hyperparameters in using Word2Vec or FastText
- What approaches can I use to find insights in my data?
- Semantic Indexing, similar words, clustering, word2vec, fasttext

Embedding Pros Cons
Bag-of-Words Simple, fast Sparse, high dimension
Does not capture position in text
Does not capture semantics
TFIDF Easy to compute
Easily compare similarity between 2
documents
Dense, high dimension
Does not capture position in text
Does not capture semantics
Topic Space Lower dimension
Captures semantics
Handles synonyms in documents
Used for search/retrieval
Number of topics needs to be defined
Could be slow in high dimension
Topics need to be “hand labeled”
Word2Vec Can leverage pretrained models
Understand relationships between
words
Better for analogies
Inability to handle unseen words
Active research - To go from word vectors to
sentence vectors
fastText Character based
Can deal with unseen words
Can leverage pretrained models
Longer to train than Word2Vec
Active research - To go from word vectors to
sentence vectors

Resources..
- https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html
- https://web.stanford.edu/class/cs124/
- https://datascience.stackexchange.com/questions/11402/preprocessing-text-before-use-rnn/11421
- https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
- http://ruder.io/word-embeddings-1/index.html
- https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
- https://www.ischool.utexas.edu/~ssoy/organizing/l391d2c.htm
- https://en.wikipedia.org/wiki/Information_retrieval
- https://en.wikipedia.org/wiki/As_We_May_Think
- https://en.wikipedia.org/wiki/Latent_semantic_analysis
- https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf - “A survey of text similarity”
- http://www.jair.org/media/2934/live-2934-4846-jair.pdf

More Resources...
- https://cs224d.stanford.edu/lecture_notes/notes1.pdf
- https://www.quora.com/What-are-the-advantages-and-disadvantages-of-TF-IDF
- https://simonpaarlberg.com/post/latent-semantic-analyses/
- https://www.quora.com/What-are-the-advantages-and-disadvantages-of-Latent-Semantic-Analysis
- http://elliottash.com/wp-content/uploads/2017/07/Text-class-05-word-embeddings-1.pdf
- http://ruder.io/secret-word2vec/index.html#addingcontextvectors
- https://www.linkedin.com/pulse/what-main-difference-between-word2vec-fasttext-federico-cesconi/
- https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

Highlight of Gensim Functions
- Gensim:
- parsing.preprocessing
- models.tfidfmodel
- models.lsimodel
- models.word2vec
- models.fastText
- models.keyedvectors
- Descriptions at: https://radimrehurek.com/gensim/apiref.html
- Tutorials at: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks

Highlight of Other Python Functions
- Sklearn:
- TfidfVectorizer
- KMeans Clustering
- Incredible documentation overall : http://scikit-learn.org/stable/index.html
- Wordcloud:
- https://github.com/amueller/word_cloud/tree/master/wordcloud

Open Source Tool for Text Search
- Really Fast!!
- Built on Lucene - Apache Project - almost 20 years old and still evolving
- Lucene - set the standard for search and indexing

Example Code - Data Preprocessing

Example Code - TFIDF and Topic Embedding (LSA)

Semantic
Syntactic
Semantic vs Syntactic
Credit: https://arxiv.org/pdf/1301.3781v3.pdf

Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business

More Related Content

Similar to Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business (20)

More from Rehgan Avon (9)

Recently uploaded (20)

Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business