0% found this document useful (0 votes)
84 views13 pages

A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium

This document summarizes an article that introduces semantic similarity calculations using Python and natural language processing (NLP). It explains that semantic similarity measures how alike the meanings of two words or phrases are by converting them into vectors and using a similarity function. The document provides an example Python script that imports a pre-trained model and calculates similarity scores between 0-1 for some sample sentences to demonstrate how semantic similarity can be measured programmatically. It also notes that different models may produce varying results, so testing multiple models is important.

Uploaded by

Tridiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views13 pages

A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium

This document summarizes an article that introduces semantic similarity calculations using Python and natural language processing (NLP). It explains that semantic similarity measures how alike the meanings of two words or phrases are by converting them into vectors and using a similarity function. The document provides an example Python script that imports a pre-trained model and calculates similarity scores between 0-1 for some sample sentences to demonstrate how semantic similarity can be measured programmatically. It also notes that different models may produce varying results, so testing multiple models is important.

Uploaded by

Tridiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Open in app

Get unlimited access to all of Medium for less than $1/week. Become a member

Semantic Similarity Calculations Using NLP


and Python: A Soft Introduction
Tanner Overcash · Follow
5 min read · Mar 21

Listen Share More

This article covers at a very high level what semantic similarity is and demonstrates a
quick example of how you can take advantage of open-source tools and pre-trained models
in your Python scripts. I hope you like the word ‘similarity’ because you’re about to read it
a thousand times.

Follow along with the code, available here.

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 1/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Similar Houses? Semantic Similarity? Get it? Photo by Maria Orlova: https://www.pexels.com/photo/similar-
houses-in-highland-valley-covered-with-dense-forest-4946680/

Introduction
Semantic Similarity is a field of Artificial Intelligence (AI), specifically Natural
Language Processing (NLP), that creates a quantitative measure of the meaning
likeness between two words or phrases. At a high level, this is done by converting
words, sentences, or phrases into a vector — a mathematical representation of that
word, sentence, or phrase— using a process called sentence embedding. Using a
function of similarity (great post about different functions of similarity here by
Ashutosh Kumar), these embeddings are used to find that quantitative measure of
similarity.

This measure of similarity can be attributed to different aspects of the subjects


you’re comparing, so be sure you fully understand what your end goals are. For
example, some similarity models compare the lengths of words or phrases to
determine a similarity measure. Others work best for comparing words in specific
lexicons, such as medical texts. The language, context, and content of what you’re
trying to compare are all considerations you need to take when choosing a tool or
model.

Environment Setup

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 2/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

To start using semantic similarity with Python, we’re going to use the sentence-
transformers library, which is a framework for state-of-the-art sentence, text, and
image embeddings. One of the reasons I’m pointing this framework out is because
of the range of different models it supports. This support is important since we’ll
not be training a model and will need iteration to find which model best suits our
needs.

Let’s get started! First, we need to install the library. I’m on an Apple Silicon
MacBook, so I needed a few prerequisites before setting up my virtual environment.
If you’re not using a Mac, you can skip ahead to creating the virtual environment.

First, we’ll need to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Next, we need to install cmake. I recommend using Homebrew to make this simple:

brew install cmake

Next, let’s set up a virtual environment. I like to use Pyenv, but from the sentence-
transformers installation notes, if you plan on using your GPU, you’ll need to use
PyTorch. I won’t cover using PyTorch here, but let me know if you’d be interested in
an article on that!

pyenv install 3.11.1


pyenv virtualenv 3.11.1 learning_nlp
pyenv activate learning_nlp
pip install -U sentence-transformers

Example Script
Now that we have the library installed let’s look at how we can use the library to
compare two sentences using a simple script.

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 3/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

from sentence_similarity import sentence_similarity

def compare_sentences(sentence_1=str, sentence_2=str, model_name=str, embedding


"""Utilizes an NLP model that calculates the similarity between
two sentences or phrases."""

model = sentence_similarity(model_name=model_name, embedding_type=embedding


score = model.get_score(sentence_1, sentence_2, metric=metric)
return(f"Comparison Score between '{sentence_1}' and '{sentence_2}': {score

model_1 = "sentence-transformers/all-MiniLM-L6-v2"

sentence_1 = "rivers woods and hills"


sentence_2 = "streams forests and mountains"
sentence_3 = "deserts sand and shrubs"

print(compare_sentences(sentence_1=sentence_1, sentence_2=sentence_2, model_nam


print(compare_sentences(sentence_1=sentence_1, sentence_2=sentence_3, model_nam
print(compare_sentences(sentence_1=sentence_2, sentence_2=sentence_3, model_nam

Here, we’ve created a function that requires two ‘sentences’ and a model name. The
function creates sentence embeddings for each sentence (the variable named
model) and calculates the similarity difference between the two (the variable named
score). I’ve predefined two variables in the function: embedding type, which is the
methodology being used to create the sentence embeddings, and the metric, or the
function of similarity. For our embedding type, I selected cls_token_embedding,
with cls standing for classification. This tells the model we’re creating embeddings
of sentences rather than full words. The metric we’re using is cosine similarity, one
of several measures.

The score returned by the function will be a number (specifically a float) between 0
and 1. You can see the different scores for each sentence, of which I created two that
are semantically similar and one that is semantically different but contextually
similar: I did this on purpose to further illustrate the importance of knowing what
you’re attempting to measure here. Each sentence describes features within a
biome, but only sentences one and two describe similar biomes. This similarity is
what we’re trying to measure, and we can see a positive correlation to our desired
output from the scores we get back:

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 4/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.84
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.631
Comparison Score between 'streams forests and mountains' and 'deserts sand
and shrubs': 0.576

Our function worked!… at least, it worked relative to the sentences we put into it. It
will be dependent on what you’re trying to accomplish if 0.84 crosses your threshold
for similar enough to mark as a positive correlation or not. This is where having
multiple models to test comes into play. Using

Now using model: 'sentence-transformers/all-MiniLM-L6-v2'


Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.84
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.631
Comparison Score between 'streams forests and mountains' and 'deserts sand
and shrubs': 0.576

Now using model: 'sentence-transformers/all-mpnet-base-v2'


Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.781
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.501
Comparison Score between 'streams forests and mountains' and 'deserts sand
and shrubs': 0.572

Now using model: 'sentence-transformers/paraphrase-MiniLM-L12-v2'


Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.922
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.764
Comparison Score between 'streams forests and mountains' and 'deserts sand
and shrubs': 0.722

Now using model: 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1'


Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.805
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.662
Comparison Score between 'streams forests and mountains' and 'deserts sand

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 5/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

and shrubs': 0.613

Now using model: 'sentence-transformers/nli-mpnet-base-v2'


Comparison Score between 'rivers woods and hills' and 'streams forests and
mountains': 0.823
Comparison Score between 'rivers woods and hills' and 'deserts sand and
shrubs': 0.446
Comparison Score between 'streams forests and mountains' and 'deserts sand
and shrubs': 0.398

As you can see, changing the model will drastically change the score you get back
after comparing the model. By changing the model you’re using (and using a MUCH
larger sample size than three sentences), you can tune the function to work for what
you need. Or, you may learn you need to train a model!

Summary
Through the colossal efforts of the open-source community, the barrier to entry for
working with NLP in Python isn’t as high as it may feel. The HuggingFace
community and sentence-similarity library offer a range of options for
quantitatively calculating the semantic similarity between words, sentences, or
phrases. By spending some time reviewing and testing different pertained models,
you can implement semantic similarity into your applications and reap the benefits
of machine learning.

NLP Semantic Similarity Python Machine Learning Artificial Intelligence

Follow

Written by Tanner Overcash


27 Followers

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 6/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Geospatial Data and Python enthusiast. Probably somewhere in the Rocky Mountains right now!

More from Tanner Overcash

Tanner Overcash

Python For Spatial Analysis


Learning To Use Geopandas

7 min read · Feb 4

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 7/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Tanner Overcash

Introduction to Remote Sensing: Part One


A brief introduction to what Remote Sensing is, how it’s performed, and some of its many uses.

9 min read · Mar 29

5 1

See all from Tanner Overcash

Recommended from Medium

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 8/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Christian Bernecker

NLP SIMILARITY: Use pretrained word embeddings for semantic


similarity search with BERT
Use pretrained word embeddings to measure document similarity and do semantic similarity
search with a BERT Transformer.

5 min read · Mar 1

68 2

Ruben Winastwan in Towards Data Science

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 9/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Semantic Textual Similarity with BERT


How to use BERT to calculate the semantic similarity between two texts

· 11 min read · Feb 15

190

Lists

Natural Language Processing


508 stories · 134 saves

Predictive Modeling w/ Python


20 stories · 267 saves

Practical Guides to Machine Learning


10 stories · 280 saves

ChatGPT
21 stories · 109 saves

dominiconorton

Optimizing Similarity Search with OpenAI’s Word Embeddings for


Pinecone Database

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 10/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

In today’s data-driven world, many businesses and organizations rely on machine learning to
process and analyze large amounts of data…

4 min read · Mar 1

Kshitiz Sahay

Fine-tuning Llama 2 for news category prediction: A step-by-step


comprehensive guide to…
A step-by-step comprehensive guide to fine-tuning any LLM.

14 min read · Aug 7

28

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 11/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

Abdulkader Helwan

Introduction to Word and Sentence Embedding


In the field of Natural Language Processing (NLP), the use of word and sentence embeddings
has revolutionized the way we analyze and…

8 min read · Feb 25

15

Rijul Dahiya

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 12/13
17/08/2023, 12:54 A Soft Introduction to NLP / Semantic Similarity Calculations Using Python | Medium

How to Build a Question-Answering App with LangChain and OpenAI


Langchain is a natural language processing library that provides various tools and models for
working with text data. In this blog post, we…

2 min read · May 3

See more recommendations

https://medium.com/@tanner.overcash/semantic-similarity-calculations-using-nlp-and-python-a-soft-introduction-1f31df965e40 13/13

You might also like