Similarity Metrics of Strings - Python
Last Updated :
17 Jan, 2025
In Python, we often need to measure the similarity between two strings. For example, consider the strings "geeks" and "geeky" —we might want to know how closely they match, whether for tasks like comparing user inputs or finding duplicate entries. Let's explore different methods to compute string similarity.
Using SequenceMatcher() from difflib
SequenceMatcher class in the difflib module provides a simple way to measure string similarity based on the ratio of matching subsequences.
Python
from difflib import SequenceMatcher
s1 = "geeks"
s2 = "geeky"
# Calculating similarity ratio
res = SequenceMatcher(None, s1, s2).ratio()
print(res)
Explanation:
- SequenceMatcher() compares two strings and calculates the ratio of matching characters.
- The ratio method returns a float between 0 and 1, indicating how similar the strings are.
- This method is simple to use and works well for general string similarity tasks.
Let's explore some more methods and see how we can find similarity metrics of strings.
Using Levenshtein distance (edit distance)
Levenshtein distance measures the number of edits (insertions, deletions, or substitutions) needed to convert one string into another.
Python
import Levenshtein
s1 = "geeks"
s2 = "geeky"
# Calculating similarity ratio
res = Levenshtein.ratio(s1, s2)
print(res)
Explanation:
- Levenshtein.ratio() method calculates a similarity score based on the edit distance.
- It is more accurate for cases where string transformations are involved.
- This method is widely used in text processing and is efficient for moderate string lengths.
Using Jaccard similarity
Jaccard similarity compares the common elements between two sets and calculates their ratio to the union of the sets.
Python
s1 = "geeks"
s2 = "geeky"
# Converting strings to sets of characters
set1 = set(s1)
set2 = set(s2)
# Calculating Jaccard similarity
res = len(set1 & set2) / len(set1 | set2)
print(res)
Explanation:
- The strings are converted into sets of characters.
- The intersection and union of the sets are used to calculate the similarity ratio.
- This method is effective for comparing unique characters and is easy to implement.
Using Cosine similarity
Cosine similarity measures the angle between two vectors in a multidimensional space, where each string is represented as a vector of character counts.
Python
from collections import Counter
from math import sqrt
s1 = "geeks"
s2 = "geeky"
# Convert strings to character frequency vectors
vec1 = Counter(s1)
vec2 = Counter(s2)
# Calculating cosine similarity
dot_product = sum(vec1[ch] * vec2[ch] for ch in vec1)
magnitude1 = sqrt(sum(count ** 2 for count in vec1.values()))
magnitude2 = sqrt(sum(count ** 2 for count in vec2.values()))
res = dot_product / (magnitude1 * magnitude2)
print(res)
Explanation:
- The strings are represented as frequency vectors using the Counter class.
- The dot product and magnitudes of the vectors are used to compute the similarity.
- This method is useful for comparing strings with weighted character counts.
Using Hamming distance
Hamming distance measures the number of differing characters at corresponding positions in two strings of equal length.
Python
s1 = "geeks"
s2 = "geeky"
# Calculating Hamming distance
res = sum(c1 != c2 for c1, c2 in zip(s1, s2)) if len(s1) == len(s2) else "Strings must be of equal length"
print(res)
Explanation:
- zip() function pairs characters from both strings for comparison.
- A generator expression counts differing characters.
- This method requires strings of equal length and is efficient for this specific task.
Similar Reads
Python - Filter Similar Case Strings Given the Strings list, the task is to write a Python program to filter all the strings which have a similar case, either upper or lower. Examples: Input : test_list = ["GFG", "Geeks", "best", "FOr", "all", "GEEKS"]Â Output : ['GFG', 'best', 'all', 'GEEKS']Â Explanation : GFG is all uppercase, best is
9 min read
Python - Similar characters Strings comparison Given two Strings, separated by delim, check if both contain same characters. Input : test_str1 = 'e!e!k!s!g', test_str2 = 'g!e!e!k!s', delim = '!' Output : True Explanation : Same characters, just diff. positions. Input : test_str1 = 'e!e!k!s', test_str2 = 'g!e!e!k!s', delim = '!' Output : False Ex
6 min read
Python | Kth index character similar Strings Sometimes, we require to get the words that have the Kth index with the specific letter. This kind of use case is quiet common in places of common programming projects or competitive programming. Letâs discuss certain shorthand to deal with this problem in Python. Method #1: Using list comprehension
3 min read
Python - Remove similar index elements in Strings Given two strings, removed all elements from both, which are the same at similar index. Input : test_str1 = 'geels', test_str2 = 'beaks' Output : gel, bak Explanation : e and s are removed as occur in same indices. Input : test_str1 = 'geeks', test_str2 = 'geeks' Output : '', '' Explanation : Same s
6 min read
Python | Grouping similar substrings in list Sometimes we have an application in which we require to group common prefix strings into one such that further processing can be done according to the grouping. This type of grouping is useful in the cases of Machine Learning and Web Development. Let's discuss certain ways in which this can be done.
7 min read