Assignment No 1 (Data Science) - Ashber
Assignment No 1 (Data Science) - Ashber
Std_ID: 20211-30975
ANSWER:
Hamming Distance and Edit Distance (Longest Common Subsequence-Based) are two different
measures used in computer science and information theory to quantify the dissimilarity or
Hamming Distance:
Definition: Hamming Distance measures the difference between two strings of equal length by
counting the number of positions at which the corresponding symbols are different. It is only
Formula: If we have two strings A and B of equal length n, the Hamming Distance (H) is
calculated as:
Longest Common Subsequence (LCS), measures the minimum number of operations (insertions,
involves constructing a matrix where each cell (i, j) represents the minimum number of
operations required to convert the first i characters of one string into the first j characters of the
other string. The value in the bottom-right cell of the matrix represents the edit distance.
Use Cases:
Edit Distance is widely used in spell-checking, DNA sequence alignment, and text comparison,
such as in plagiarism detection and natural language processing tasks like machine translation
Qno:4 What are the main reasons, we should have to use Hamming distance and Edit distance?
ANSWER:
Hamming Distance and Edit Distance serve distinct purposes and are used in different contexts
strings or sequences of equal length. Here are the main reasons to use Hamming Distance:
1. Error Detection and Correction: Hamming Distance is fundamental in error detection and
correction codes, such as Hamming codes. It helps identify and correct errors in
where each bit represents a specific attribute or condition. It's commonly used in digital
3. String Matching: In cases where you have strings of equal length and you want to find
4. Genetics and DNA Analysis: Hamming Distance can be applied in genetics to measure
the difference between DNA sequences or genomes. It helps identify genetic mutations
and variations.
5. Hardware Testing: In hardware testing and circuit design, Hamming Distance is used to
On the other hand, Edit Distance (LCS-Based) is used when you want to measure the similarity
or dissimilarity between two strings of varying lengths by considering insertions, deletions, and
comparing and aligning text documents. It's valuable in tasks like spell-checking,
2. Spell Checking: Edit Distance helps suggest corrections for misspelled words by finding
4. OCR (Optical Character Recognition): Edit Distance can be employed in OCR systems to
5. Data Cleaning and Deduplication: Edit Distance can be used to clean and deduplicate
ANSWER:
Cosine Similarity is a metric used to measure the similarity between two vectors in a
multidimensional space, often in the context of information retrieval, text analysis, and
recommendation systems. It calculates the cosine of the angle between two vectors, which can
range from -1
Concept:
Imagine you have two vectors, A and B, in a multi-dimensional space. Each dimension represents
a different attribute or feature, and the values in each dimension represent the magnitude of that
Cosine Similarity measures the cosine of the angle θ between these two vectors, which is
calculated as follows:
Example:
Let's say you're working on a text analysis task where you want to compare the similarity
Document A: "The quick brown fox jumps over the lazy dog."
1. Vector Representation:
Convert each document into a vector where each dimension corresponds to the frequency of a
specific word. For simplicity, let's use a smaller vocabulary: {quick, brown, fox, jumps, over, the,
B = [0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 2, 1, 1]
A · B = (1 * 0) + (1 * 1) + (1 * 1) + (1 * 0) + (1 * 1) + (2 * 0) + (1 * 0) + (1 * 1) + (0 * 1) + (0 *
1) + (0 * 2) + (0 * 1) + (0 * 1) = 3
‖A‖ = sqrt((1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (2^2) + (1^2) + (1^2) + (0^2) + (0^2) + (0^2)
+
‖B‖ = sqrt((0^2) + (1^2) + (1^2) + (0^2) + (1^2) + (0^2) + (0^2) + (1^2) + (1^2) + (1^2) + (2^2) +
3. Interpretation:
The Cosine Similarity value of approximately 0.27 suggests that the two documents, A and B, are
somewhat similar but not highly similar. The value falls between -1 (completely dissimilar) and 1
(completely similar).
In this example, Cosine Similarity is used to compare the similarity between two documents
based on the frequency of words they share. Higher values of Cosine Similarity indicate greater
similarity, which is a useful measure for various applications like document retrieval,
Qno:2 Compute the distance between Row 1 and Row 2 of the following table by using the
Euclidean Manhattan distance formula.
Feature 1 Feature 2 Feature 3
Row 1 10 3 3
Row 2 5 4 5