0% found this document useful (0 votes)
19 views9 pages

Assignment No 1 (Data Science) - Ashber

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Assignment No 1 (Data Science) - Ashber

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Assignment No:1

Subject: Introduction to Data Science

Name: ASHBER KHAN

Std_ID: 20211-30975

Qno:3 Explain the Hamming distance and Edit distance (LCS-based).

ANSWER:

Hamming Distance and Edit Distance (Longest Common Subsequence-Based) are two different

measures used in computer science and information theory to quantify the dissimilarity or

difference between two strings or sequences.

Hamming Distance:

Definition: Hamming Distance measures the difference between two strings of equal length by

counting the number of positions at which the corresponding symbols are different. It is only

defined for strings of equal length.

Formula: If we have two strings A and B of equal length n, the Hamming Distance (H) is

calculated as:

H(A, B) = Σᵢ (A[i] ≠ B[i])

Edit Distance (LCS-Based):


Definition: Edit Distance, also known as Levenshtein Distance or Edit Distance based on

Longest Common Subsequence (LCS), measures the minimum number of operations (insertions,

deletions, and substitutions) required to transform one string into another.

Algorithm: The algorithm to calculate edit distance is typically dynamic programming. It

involves constructing a matrix where each cell (i, j) represents the minimum number of

operations required to convert the first i characters of one string into the first j characters of the

other string. The value in the bottom-right cell of the matrix represents the edit distance.

Use Cases:

Edit Distance is widely used in spell-checking, DNA sequence alignment, and text comparison,

such as in plagiarism detection and natural language processing tasks like machine translation

and speech recognition.

It can also be used to quantify the similarity between two strings.

Qno:4 What are the main reasons, we should have to use Hamming distance and Edit distance?

ANSWER:

Hamming Distance and Edit Distance serve distinct purposes and are used in different contexts

due to their specific characteristics:


Hamming Distance is primarily used when you want to measure the difference between two

strings or sequences of equal length. Here are the main reasons to use Hamming Distance:

1. Error Detection and Correction: Hamming Distance is fundamental in error detection and

correction codes, such as Hamming codes. It helps identify and correct errors in

transmitted or stored data.

2. Bit-Level Comparison: Hamming Distance is ideal for comparing binary sequences,

where each bit represents a specific attribute or condition. It's commonly used in digital

communication and networking.

3. String Matching: In cases where you have strings of equal length and you want to find

similar strings by comparing their characters at corresponding positions, Hamming

Distance can be useful.

4. Genetics and DNA Analysis: Hamming Distance can be applied in genetics to measure

the difference between DNA sequences or genomes. It helps identify genetic mutations

and variations.

5. Hardware Testing: In hardware testing and circuit design, Hamming Distance is used to

check the integrity of data during data transfer.

On the other hand, Edit Distance (LCS-Based) is used when you want to measure the similarity

or dissimilarity between two strings of varying lengths by considering insertions, deletions, and

substitutions. Here are the main reasons to use Edit Distance:


1. Text Comparison: Edit Distance is commonly used in natural language processing for

comparing and aligning text documents. It's valuable in tasks like spell-checking,

plagiarism detection, and machine translation.

2. Spell Checking: Edit Distance helps suggest corrections for misspelled words by finding

the most similar words in a dictionary.

3. Bioinformatics: In bioinformatics, Edit Distance is used to compare DNA or protein

sequences. It aids in sequence alignment and understanding genetic relationships.

4. OCR (Optical Character Recognition): Edit Distance can be employed in OCR systems to

recognize and correct errors in scanned text.

5. Data Cleaning and Deduplication: Edit Distance can be used to clean and deduplicate

data by identifying similar records with slight variations.

Qno:5 Elaborate the Cosine Similarity with suitable examples.

ANSWER:

Cosine Similarity is a metric used to measure the similarity between two vectors in a

multidimensional space, often in the context of information retrieval, text analysis, and

recommendation systems. It calculates the cosine of the angle between two vectors, which can

range from -1

(completely dissimilar) to 1 (completely similar), with 0 indicating orthogonality or no similarity.

Here's a detailed explanation of Cosine Similarity with suitable examples:

Concept:
Imagine you have two vectors, A and B, in a multi-dimensional space. Each dimension represents

a different attribute or feature, and the values in each dimension represent the magnitude of that

feature for the respective vector.

Cosine Similarity measures the cosine of the angle θ between these two vectors, which is

calculated as follows:

Cosine Similarity Formula

A · B represents the dot product of vectors A and B.

‖A‖ represents the magnitude (length) of vector A.

‖B‖ represents the magnitude (length) of vector B.

Example:

Let's say you're working on a text analysis task where you want to compare the similarity

between two documents represented as vectors in a high-dimensional space. Each dimension

could represent the frequency of a specific word in the documents.

Document A: "The quick brown fox jumps over the lazy dog."

Document B: "A fast brown fox leaps over a sleeping canine."

1. Vector Representation:
Convert each document into a vector where each dimension corresponds to the frequency of a

specific word. For simplicity, let's use a smaller vocabulary: {quick, brown, fox, jumps, over, the,

lazy, dog, fast, leaps, a, sleeping, canine}.

Calculate the word frequencies in each document:

Vector A = [1, 1, 1, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0] Vector

B = [0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 2, 1, 1]

2. Cosine Similarity Calculation:

• Calculate the dot product of the two vectors (A · B).

A · B = (1 * 0) + (1 * 1) + (1 * 1) + (1 * 0) + (1 * 1) + (2 * 0) + (1 * 0) + (1 * 1) + (0 * 1) + (0 *

1) + (0 * 2) + (0 * 1) + (0 * 1) = 3

• Calculate the magnitude (Euclidean) of each vector (‖A‖ and ‖B‖).

‖A‖ = sqrt((1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (2^2) + (1^2) + (1^2) + (0^2) + (0^2) + (0^2)
+

(0^2) + (0^2)) = sqrt(11) ≈ 3.32

‖B‖ = sqrt((0^2) + (1^2) + (1^2) + (0^2) + (1^2) + (0^2) + (0^2) + (1^2) + (1^2) + (1^2) + (2^2) +

(1^2) + (1^2)) = sqrt(13) ≈ 3.61

• Calculate the Cosine Similarity:

Cosine Similarity = A · B / (‖A‖ * ‖B‖) = 3 / (3.32 * 3.61) ≈ 0.27

3. Interpretation:
The Cosine Similarity value of approximately 0.27 suggests that the two documents, A and B, are

somewhat similar but not highly similar. The value falls between -1 (completely dissimilar) and 1

(completely similar).

In this example, Cosine Similarity is used to compare the similarity between two documents

based on the frequency of words they share. Higher values of Cosine Similarity indicate greater

similarity, which is a useful measure for various applications like document retrieval,

recommendation systems, and information retrieval tasks.


Assignment No:1
Subject: Introduction to Data Science
Qno:1 Compute the distance between Row 2 and Row 3 of the following table by using the
Euclidean and Manhattan distance formula.
Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 3 2

Qno:2 Compute the distance between Row 1 and Row 2 of the following table by using the
Euclidean Manhattan distance formula.
Feature 1 Feature 2 Feature 3
Row 1 10 3 3
Row 2 5 4 5

Qno:3 Explain the Hamming distance and Edit distance (LCS-based).


Qno:4 What are the main reasons, we should have to use Hamming distance and Edit distance?
Qno:5 Elaborate the Cosine Similarity with suitable examples.

You might also like