A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
Open in app
Get unlimited access to all of Medium for less than $1/week. Become a member
This article covers at a very high level what semantic similarity is and demonstrates a
quick example of how you can take advantage of open-source tools and pre-trained models
in your Python scripts. I hope you like the word ‘similarity’ because you’re about to read it
a thousand times.
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 1/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
Similar Houses? Semantic Similarity? Get it? Photo by Maria Orlova: https://www.pexels.com/photo/similar-
houses-in-highland-valley-covered-with-dense-forest-4946680/
Introduction
Semantic Similarity is a field of Artificial Intelligence (AI), specifically Natural
Language Processing (NLP), that creates a quantitative measure of the meaning
likeness between two words or phrases. At a high level, this is done by converting
words, sentences, or phrases into a vector — a mathematical representation of that
word, sentence, or phrase— using a process called sentence embedding. Using a
function of similarity (great post about different functions of similarity here by
Ashutosh Kumar), these embeddings are used to find that quantitative measure of
similarity.
Environment Setup
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 2/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
To start using semantic similarity with Python, we’re going to use the sentence-
transformers library, which is a framework for state-of-the-art sentence, text, and
image embeddings. One of the reasons I’m pointing this framework out is because
of the range of different models it supports. This support is important since we’ll
not be training a model and will need iteration to find which model best suits our
needs.
Let’s get started! First, we need to install the library. I’m on an Apple Silicon
MacBook, so I needed a few prerequisites before setting up my virtual environment.
If you’re not using a Mac, you can skip ahead to creating the virtual environment.
Next, we need to install cmake. I recommend using Homebrew to make this simple:
Next, let’s set up a virtual environment. I like to use Pyenv, but from the sentence-
transformers installation notes, if you plan on using your GPU, you’ll need to use
PyTorch. I won’t cover using PyTorch here, but let me know if you’d be interested in
an article on that!
Example Script
Now that we have the library installed let’s look at how we can use the library to
compare two sentences using a simple script.
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 3/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
model_1 = "sentence-transformers/all-MiniLM-L6-v2"
Here, we’ve created a function that requires two ‘sentences’ and a model name. The
function creates sentence embeddings for each sentence (the variable named
model) and calculates the similarity difference between the two (the variable named
score). I’ve predefined two variables in the function: embedding type, which is the
methodology being used to create the sentence embeddings, and the metric, or the
function of similarity. For our embedding type, I selected cls_token_embedding,
with cls standing for classification. This tells the model we’re creating embeddings
of sentences rather than full words. The metric we’re using is cosine similarity, one
of several measures.
The score returned by the function will be a number (specifically a float) between 0
and 1. You can see the different scores for each sentence, of which I created two that
are semantically similar and one that is semantically different but contextually
similar: I did this on purpose to further illustrate the importance of knowing what
you’re attempting to measure here. Each sentence describes features within a
biome, but only sentences one and two describe similar biomes. This similarity is
what we’re trying to measure, and we can see a positive correlation to our desired
output from the scores we get back:
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 4/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.84
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.631
Comparison Score between 'streams forests and mountains' and 'deserts sand
and shrubs': 0.576
Our function worked!… at least, it worked relative to the sentences we put into it. It
will be dependent on what you’re trying to accomplish if 0.84 crosses your threshold
for similar enough to mark as a positive correlation or not. This is where having
multiple models to test comes into play. Using
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 5/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
As you can see, changing the model will drastically change the score you get back
after comparing the model. By changing the model you’re using (and using a MUCH
larger sample size than three sentences), you can tune the function to work for what
you need. Or, you may learn you need to train a model!
Summary
Through the colossal efforts of the open-source community, the barrier to entry for
working with NLP in Python isn’t as high as it may feel. The HuggingFace
community and sentence-similarity library offer a range of options for
quantitatively calculating the semantic similarity between words, sentences, or
phrases. By spending some time reviewing and testing different pertained models,
you can implement semantic similarity into your applications and reap the benefits
of machine learning.
Follow
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 6/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
Geospatial Data and Python enthusiast. Probably somewhere in the Rocky Mountains right now!
Tanner Overcash
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 7/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
Tanner Overcash
5 1
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 8/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
Christian Bernecker
68 2
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 9/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
190
Lists
ChatGPT
21 stories · 109 saves
dominiconorton
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 10/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
In today’s data-driven world, many businesses and organizations rely on machine learning to
process and analyze large amounts of data…
Kshitiz Sahay
28
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 11/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
Abdulkader Helwan
15
Rijul Dahiya
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 12/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium
https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 13/13