0% found this document useful (0 votes)

19 views9 pages

Assignment No 1 (Data Science) - Ashber

Uploaded by

Ashber Ur Rehman Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views9 pages

Assignment No 1 (Data Science) - Ashber

Uploaded by

Ashber Ur Rehman Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Assignment No:1

Subject: Introduction to Data Science

Name: ASHBER KHAN

Std_ID: 20211-30975

Qno:3 Explain the Hamming distance and Edit distance (LCS-based).

ANSWER:

Hamming Distance and Edit Distance (Longest Common Subsequence-Based) are two different

measures used in computer science and information theory to quantify the dissimilarity or

difference between two strings or sequences.

Hamming Distance:

Definition: Hamming Distance measures the difference between two strings of equal length by

counting the number of positions at which the corresponding symbols are different. It is only

defined for strings of equal length.

Formula: If we have two strings A and B of equal length n, the Hamming Distance (H) is

calculated as:

H(A, B) = Σᵢ (A[i] ≠ B[i])

Edit Distance (LCS-Based):

Definition: Edit Distance, also known as Levenshtein Distance or Edit Distance based on

Longest Common Subsequence (LCS), measures the minimum number of operations (insertions,

deletions, and substitutions) required to transform one string into another.

Algorithm: The algorithm to calculate edit distance is typically dynamic programming. It

involves constructing a matrix where each cell (i, j) represents the minimum number of

operations required to convert the first i characters of one string into the first j characters of the

other string. The value in the bottom-right cell of the matrix represents the edit distance.

Use Cases:

Edit Distance is widely used in spell-checking, DNA sequence alignment, and text comparison,

such as in plagiarism detection and natural language processing tasks like machine translation

and speech recognition.

It can also be used to quantify the similarity between two strings.

Qno:4 What are the main reasons, we should have to use Hamming distance and Edit distance?

ANSWER:

Hamming Distance and Edit Distance serve distinct purposes and are used in different contexts

due to their specific characteristics:

Hamming Distance is primarily used when you want to measure the difference between two

strings or sequences of equal length. Here are the main reasons to use Hamming Distance:

1. Error Detection and Correction: Hamming Distance is fundamental in error detection and

correction codes, such as Hamming codes. It helps identify and correct errors in

transmitted or stored data.

2. Bit-Level Comparison: Hamming Distance is ideal for comparing binary sequences,

where each bit represents a specific attribute or condition. It's commonly used in digital

communication and networking.

3. String Matching: In cases where you have strings of equal length and you want to find

similar strings by comparing their characters at corresponding positions, Hamming

Distance can be useful.

4. Genetics and DNA Analysis: Hamming Distance can be applied in genetics to measure

the difference between DNA sequences or genomes. It helps identify genetic mutations

and variations.

5. Hardware Testing: In hardware testing and circuit design, Hamming Distance is used to

check the integrity of data during data transfer.

On the other hand, Edit Distance (LCS-Based) is used when you want to measure the similarity

or dissimilarity between two strings of varying lengths by considering insertions, deletions, and

substitutions. Here are the main reasons to use Edit Distance:

1. Text Comparison: Edit Distance is commonly used in natural language processing for

comparing and aligning text documents. It's valuable in tasks like spell-checking,

plagiarism detection, and machine translation.

2. Spell Checking: Edit Distance helps suggest corrections for misspelled words by finding

the most similar words in a dictionary.

3. Bioinformatics: In bioinformatics, Edit Distance is used to compare DNA or protein

sequences. It aids in sequence alignment and understanding genetic relationships.

4. OCR (Optical Character Recognition): Edit Distance can be employed in OCR systems to

recognize and correct errors in scanned text.

5. Data Cleaning and Deduplication: Edit Distance can be used to clean and deduplicate

data by identifying similar records with slight variations.

Qno:5 Elaborate the Cosine Similarity with suitable examples.

ANSWER:

Cosine Similarity is a metric used to measure the similarity between two vectors in a

multidimensional space, often in the context of information retrieval, text analysis, and

recommendation systems. It calculates the cosine of the angle between two vectors, which can

range from -1

(completely dissimilar) to 1 (completely similar), with 0 indicating orthogonality or no similarity.

Here's a detailed explanation of Cosine Similarity with suitable examples:

Concept:
Imagine you have two vectors, A and B, in a multi-dimensional space. Each dimension represents

a different attribute or feature, and the values in each dimension represent the magnitude of that

feature for the respective vector.

Cosine Similarity measures the cosine of the angle θ between these two vectors, which is

calculated as follows:

Cosine Similarity Formula

A · B represents the dot product of vectors A and B.

‖A‖ represents the magnitude (length) of vector A.

‖B‖ represents the magnitude (length) of vector B.

Example:

Let's say you're working on a text analysis task where you want to compare the similarity

between two documents represented as vectors in a high-dimensional space. Each dimension

could represent the frequency of a specific word in the documents.

Document A: "The quick brown fox jumps over the lazy dog."

Document B: "A fast brown fox leaps over a sleeping canine."

1. Vector Representation:
Convert each document into a vector where each dimension corresponds to the frequency of a

specific word. For simplicity, let's use a smaller vocabulary: {quick, brown, fox, jumps, over, the,

lazy, dog, fast, leaps, a, sleeping, canine}.

Calculate the word frequencies in each document:

Vector A = [1, 1, 1, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0] Vector

B = [0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 2, 1, 1]

2. Cosine Similarity Calculation:

• Calculate the dot product of the two vectors (A · B).

A · B = (1 * 0) + (1 * 1) + (1 * 1) + (1 * 0) + (1 * 1) + (2 * 0) + (1 * 0) + (1 * 1) + (0 * 1) + (0 *

1) + (0 * 2) + (0 * 1) + (0 * 1) = 3

• Calculate the magnitude (Euclidean) of each vector (‖A‖ and ‖B‖).

‖A‖ = sqrt((1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (2^2) + (1^2) + (1^2) + (0^2) + (0^2) + (0^2)
+

(0^2) + (0^2)) = sqrt(11) ≈ 3.32

‖B‖ = sqrt((0^2) + (1^2) + (1^2) + (0^2) + (1^2) + (0^2) + (0^2) + (1^2) + (1^2) + (1^2) + (2^2) +

(1^2) + (1^2)) = sqrt(13) ≈ 3.61

• Calculate the Cosine Similarity:

Cosine Similarity = A · B / (‖A‖ * ‖B‖) = 3 / (3.32 * 3.61) ≈ 0.27

3. Interpretation:
The Cosine Similarity value of approximately 0.27 suggests that the two documents, A and B, are

somewhat similar but not highly similar. The value falls between -1 (completely dissimilar) and 1

(completely similar).

In this example, Cosine Similarity is used to compare the similarity between two documents

based on the frequency of words they share. Higher values of Cosine Similarity indicate greater

similarity, which is a useful measure for various applications like document retrieval,

recommendation systems, and information retrieval tasks.

Assignment No:1
Subject: Introduction to Data Science
Qno:1 Compute the distance between Row 2 and Row 3 of the following table by using the
Euclidean and Manhattan distance formula.
Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 3 2

Qno:2 Compute the distance between Row 1 and Row 2 of the following table by using the
Euclidean Manhattan distance formula.
Feature 1 Feature 2 Feature 3
Row 1 10 3 3
Row 2 5 4 5

Qno:3 Explain the Hamming distance and Edit distance (LCS-based).

Qno:4 What are the main reasons, we should have to use Hamming distance and Edit distance?
Qno:5 Elaborate the Cosine Similarity with suitable examples.

9 Distance Measures in Data Science
No ratings yet
9 Distance Measures in Data Science
9 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Lec-3. datamining-similarity-distance-ext
No ratings yet
Lec-3. datamining-similarity-distance-ext
104 pages
Lecture -7 MSDS
No ratings yet
Lecture -7 MSDS
32 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
No ratings yet
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
11 pages
III-clustering
No ratings yet
III-clustering
87 pages
DSB- Unit3
No ratings yet
DSB- Unit3
87 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Hamming Distance Ppt2
No ratings yet
Hamming Distance Ppt2
11 pages
03 Schubert
No ratings yet
03 Schubert
13 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Chapter 8 - Collaborative_Filtering
No ratings yet
Chapter 8 - Collaborative_Filtering
118 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Distance Measure
No ratings yet
Distance Measure
11 pages
A Guide To Hamming Distance
No ratings yet
A Guide To Hamming Distance
5 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
BDA
No ratings yet
BDA
31 pages
Lab 2
No ratings yet
Lab 2
21 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
Clustering
No ratings yet
Clustering
43 pages
BioInfor assignment
No ratings yet
BioInfor assignment
4 pages
Chapter_2
No ratings yet
Chapter_2
70 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Similarity
No ratings yet
Similarity
20 pages
Manhattan & Euclidean Distance
No ratings yet
Manhattan & Euclidean Distance
16 pages
Cosine Similarity_
No ratings yet
Cosine Similarity_
3 pages
Hamming Distance
No ratings yet
Hamming Distance
4 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Task 1
No ratings yet
Task 1
5 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
ML-UNIT-2
No ratings yet
ML-UNIT-2
22 pages
Text Similarity Metrics
No ratings yet
Text Similarity Metrics
10 pages
Similarity
No ratings yet
Similarity
20 pages
L13
No ratings yet
L13
19 pages
2(c)_Jaccard and cosine method
No ratings yet
2(c)_Jaccard and cosine method
6 pages
DP and Edit Dist
No ratings yet
DP and Edit Dist
30 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
CS 3308 Learning Journal Unit 4
No ratings yet
CS 3308 Learning Journal Unit 4
5 pages
Levenshtein
No ratings yet
Levenshtein
14 pages
alshammari-2023-ijca-922667
No ratings yet
alshammari-2023-ijca-922667
4 pages
Dist
No ratings yet
Dist
14 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
Levenshtein Distance Task
No ratings yet
Levenshtein Distance Task
3 pages
2410.09871v1
No ratings yet
2410.09871v1
19 pages
Lec10 12 Edit Distance
No ratings yet
Lec10 12 Edit Distance
54 pages
NameMatching
No ratings yet
NameMatching
14 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Sharma 2014
No ratings yet
Sharma 2014
29 pages
Updated Dumps for UiPath-SAIv1 Exam
No ratings yet
Updated Dumps for UiPath-SAIv1 Exam
16 pages
WIELING M. - NERBONNE J. - Advances in Dialectometry PDF
No ratings yet
WIELING M. - NERBONNE J. - Advances in Dialectometry PDF
47 pages
How To Measure Data Quality - A Metric-Based Approach
No ratings yet
How To Measure Data Quality - A Metric-Based Approach
16 pages
A Comparison and Analysis of Name Matching Algorithms
No ratings yet
A Comparison and Analysis of Name Matching Algorithms
6 pages
Measuring Musical Rhythm Similarity: Transformation Versus Feature-Based Methods
100% (1)
Measuring Musical Rhythm Similarity: Transformation Versus Feature-Based Methods
31 pages
Structural and Semantic Similarity Measurement of UML Use Case Diagram
No ratings yet
Structural and Semantic Similarity Measurement of UML Use Case Diagram
13 pages
JUDGES in The LAB. No Common-Civil Law Differences
No ratings yet
JUDGES in The LAB. No Common-Civil Law Differences
127 pages
Advanced Topics in Information Systems
No ratings yet
Advanced Topics in Information Systems
175 pages
Understanding Plagiarism Linguistic Patterns, Textual Features and Detection Methods
No ratings yet
Understanding Plagiarism Linguistic Patterns, Textual Features and Detection Methods
17 pages
Edit Distance
No ratings yet
Edit Distance
13 pages
IMRad
No ratings yet
IMRad
18 pages
Anatomy of The Infamous Artificial: Example by Caesar Ogole
No ratings yet
Anatomy of The Infamous Artificial: Example by Caesar Ogole
33 pages
Category Functions and CALL Routines Description
No ratings yet
Category Functions and CALL Routines Description
18 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Whitepaper PDF
No ratings yet
Whitepaper PDF
15 pages
Comparing and Managing Multiple Versions of Slide Presentations
No ratings yet
Comparing and Managing Multiple Versions of Slide Presentations
9 pages
Detecting Plagiarism in Academics Using Levenshtein Distance Algorithm and Semantic Similarity
No ratings yet
Detecting Plagiarism in Academics Using Levenshtein Distance Algorithm and Semantic Similarity
3 pages
Rosetta: Large Scale System For Text Detection and Recognition in Images
No ratings yet
Rosetta: Large Scale System For Text Detection and Recognition in Images
9 pages
Levenshtein Distance PDF
No ratings yet
Levenshtein Distance PDF
3 pages
Shimaa IsmailSemanticSimilarity
No ratings yet
Shimaa IsmailSemanticSimilarity
11 pages
SmartReceipts in Fusion Receivables Receipts
No ratings yet
SmartReceipts in Fusion Receivables Receipts
2 pages
Levenshtein Algorithm 1 PDF
No ratings yet
Levenshtein Algorithm 1 PDF
10 pages
Job Recommendation: An Approach To Match Job-Seeker's Interest With Enterprise's Requirement - Ngoc-Trung-Kien Ho & Hung Ho-Dac & Tuan-Anh Le
No ratings yet
Job Recommendation: An Approach To Match Job-Seeker's Interest With Enterprise's Requirement - Ngoc-Trung-Kien Ho & Hung Ho-Dac & Tuan-Anh Le
7 pages
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Vectors and Their Applications
From Everand
Vectors and Their Applications
Anthony J. Pettofrezzo
No ratings yet

Assignment No 1 (Data Science) - Ashber

Uploaded by

Assignment No 1 (Data Science) - Ashber

Uploaded by

Assignment No:1

Subject: Introduction to Data Science

Name: ASHBER KHAN

Qno:3 Explain the Hamming distance and Edit distance (LCS-based).

difference between two strings or sequences.

defined for strings of equal length.

H(A, B) = Σᵢ (A[i] ≠ B[i])

Edit Distance (LCS-Based):

deletions, and substitutions) required to transform one string into another.

Algorithm: The algorithm to calculate edit distance is typically dynamic programming. It

and speech recognition.

It can also be used to quantify the similarity between two strings.

due to their specific characteristics:

transmitted or stored data.

2. Bit-Level Comparison: Hamming Distance is ideal for comparing binary sequences,

communication and networking.

similar strings by comparing their characters at corresponding positions, Hamming

Distance can be useful.

check the integrity of data during data transfer.

substitutions. Here are the main reasons to use Edit Distance:

plagiarism detection, and machine translation.

the most similar words in a dictionary.

3. Bioinformatics: In bioinformatics, Edit Distance is used to compare DNA or protein

sequences. It aids in sequence alignment and understanding genetic relationships.

recognize and correct errors in scanned text.

data by identifying similar records with slight variations.

Qno:5 Elaborate the Cosine Similarity with suitable examples.

(completely dissimilar) to 1 (completely similar), with 0 indicating orthogonality or no similarity.

Here's a detailed explanation of Cosine Similarity with suitable examples:

feature for the respective vector.

Cosine Similarity Formula

A · B represents the dot product of vectors A and B.

‖A‖ represents the magnitude (length) of vector A.

‖B‖ represents the magnitude (length) of vector B.

between two documents represented as vectors in a high-dimensional space. Each dimension

could represent the frequency of a specific word in the documents.

Document B: "A fast brown fox leaps over a sleeping canine."

lazy, dog, fast, leaps, a, sleeping, canine}.

Calculate the word frequencies in each document:

Vector A = [1, 1, 1, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0] Vector

2. Cosine Similarity Calculation:

• Calculate the dot product of the two vectors (A · B).

• Calculate the magnitude (Euclidean) of each vector (‖A‖ and ‖B‖).

(0^2) + (0^2)) = sqrt(11) ≈ 3.32

(1^2) + (1^2)) = sqrt(13) ≈ 3.61

• Calculate the Cosine Similarity:

Cosine Similarity = A · B / (‖A‖ * ‖B‖) = 3 / (3.32 * 3.61) ≈ 0.27

recommendation systems, and information retrieval tasks.

Qno:3 Explain the Hamming distance and Edit distance (LCS-based).

You might also like