0% found this document useful (0 votes)

18 views

ir-journal

The document outlines practical exercises for Information Retrieval (IR) at the University of Mumbai, focusing on document indexing, retrieval models, spelling correction, and evaluation metrics. Key concepts include inverted indexes, Boolean retrieval models, TF-IDF weighting, cosine similarity, and edit distance algorithms. The practical implementations involve coding examples in Python to demonstrate these concepts.

Uploaded by

deva maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

ir-journal

Uploaded by

deva maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 41

IR Journal

Information Retrieval (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 1
Aim: Document Indexing and Retrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the
constructed index.

Theory :
● An Inverted Index is a data structure used in information retrieval
systems to efficiently retrieve documents or web pages containing
a specific term or set of terms.
● In an inverted index, the index is organised by terms (words), and
each term points to a list of documents or web pages that contain that
term.
● Inverted indexes are widely used in search engines, database systems,
and other applications where efficient text search is required.
● They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow. An
inverted index is an index data structure storing a mapping from
content, such as words or numbers, to its locations in a document or a
set of documents.

Rules to create an inverted index -

1) The text of each document is first preprocessed by removing stop words :
Stop words are the most occurring and useless words in documents like
“I”, “the”, “we”, “is”, and “an”.
2) The text is tokenized, meaning that it is split into individual terms.
3) The terms are then added to the index, with each term pointing to
the documents in which it appears.

Practical:
Input:
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK

# Define the documents

document1 = "The quick brown fox jumped over the lazy dog"
document2 = "The lazy dog slept in the sun"

Information Retrieval 1
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

# Get the stopwords for English language from NLTK

nltk.download('stopwords')
stopWords = stopwords.words('english')

# Step 1: Tokenize the documents

# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()

# Combine the tokens into a list of unique terms

terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index

# Create an empty dictionary to store the inverted index as well as a dictionary
to store number of occurrences
inverted_index = {}
occ_num_doc1 =
{} occ_num_doc2
= {}

# For each term, find the documents that contain it

for term in terms:
if term in stopWords:
continue
documents = []
if term in tokens1:
documents.append("Document 1")
occ_num_doc1[term] =
tokens1.count(term)
if term in tokens2:
documents.append("Document 2")
occ_num_doc2[term] =
tokens2.count(term)
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Information Retrieval 2

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

inverted_index[term] = documents

# Step 3: Print the inverted index

for term, documents in inverted_index.items():
print(term, "->", end=" ")
for doc in documents:
if doc == "Document 1":
print(f"{doc} ({occ_num_doc1.get(term, 0)}),", end=" ")
else:
print(f"{doc} ({occ_num_doc2.get(term, 0)}),", end=" ")
print()
print("Performed by 740_Pallavi & 743_Deepak")
Output:

Information Retrieval 3

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and
cosine similarity.

Theory :
A) Boolean Retrieval Model -
●A Boolean model is a fundamental concept in Information
Retrieval (IR) that is used to represent and retrieve documents or
information based on Boolean logic.

● In this model, a document is typically represented as a set of terms

(words or phrases), and queries are also represented using
Boolean operators (AND, OR, NOT) to specify the desired
information.

Here's how the Boolean model works in IR:

1. Document Representation: Each document in the collection is

represented as a set of terms. These terms can be extracted
from the document's content and can be single words,
phrases, or other units of information.

2. Query Representation: Queries are also represented as sets of terms,

and Boolean operators (AND, OR, NOT) are used to combine
these terms to express the user's information needs. For example, a
query might be "cats AND dogs," meaning the user wants
documents that contain both "cats" and "dogs."

3. Boolean Operators:
●AND: "cats AND dogs,"
both "cats" and "dogs" will be retrieved.
● OR: "cats OR dogs,"
"cats" or "dogs" or both will be retrieved.
● NOT: "cats NOT
dogs" "cats" but not
"dogs."

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS
Information Retrieval 4

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

B) TF-IDF
● Term Frequency - Inverse Document Frequency (TF-IDF) is
a widely used statistical method in information retrieval.
● It measures how important a term is within a
document relative to a collection of documents.

Term Frequency (TF): TF of a term or word is the number of times

the term appears in a document compared to the total number of words
in the document.

Inverse Document Frequency(IDF):

● IDF of a term reflects the proportion of documents in the
corpus that contain the term.
IDF = log( N / df ) where,
N= total no. of documents
df = no. of documents containing a term

● The TF-IDF of a term is calculated by

multiplying TF and IDF scores. TF-IDF =
TF*IDE

C) Cosine Similarity -

● Cosine similarity is a measure of similarity between two

non- zero vectors defined in an inner product space.
● Cosine similarity is the cosine of the angle between
the vectors. ● The cosine similarity always belongs to
the interval [−1,1].
● In cosine similarity, data objects in a dataset are treated as
a vector. ● The formula to find the cosine similarity between
two vectors is -

Information Retrieval 5
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

Here A . B is the product of the vector.

Practical:

A) Implement the Boolean retrieval model and process

queries: Input:

documents = {
1: "apple banana orange",
2: "apple banana",
3: "banana orange",
4: "apple"
}

# Function to build an inverted index using dictionaries

def build_index(docs):
index = {} # Initialize an empty dictionary to store the inverted index
for doc_id, text in docs.items(): # Iterate through each document and its text
terms = set(text.split()) # Split the text into individual terms
for term in terms: # Iterate through each term in the document
if term not in index:
index[term] = {doc_id} # If the term is not in the index, create a new
set with document ID
else:
index[term].add(doc_id) # If the term exists, add the document ID to
its set
return index # Return the built inverted index

Information Retrieval 6

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

# Building the inverted index

inverted_index = build_index(documents)

# Function for Boolean AND operation using inverted index

def boolean_and(operands, index):
if not operands: # If there are no operands, return all document IDs
return list(range(1, len(documents) + 1))

result = index.get(operands[0], set()) # Get the set of document IDs for the
first operand
for term in operands[1:]: # Iterate through the rest of the operands
result = result.intersection(index.get(term, set())) # Compute intersection
with sets of document IDs
return list(result) # Return the resulting list of document IDs

# Function for Boolean OR operation using inverted index

def boolean_or(operands, index):
result = set() # Initialize an empty set to store the resulting document IDs
for term in operands: # Iterate through each term in the query
result = result.union(index.get(term, set())) # Union of sets of document
IDs for each term
return list(result) # Return the resulting list of document IDs

# Function for Boolean NOT operation using inverted index

def boolean_not(operand, index, total_docs):
operand_set = set(index.get(operand, set())) # Get the set of document IDs
for the operand
all_docs_set = set(range(1, total_docs + 1)) # Create a set of all document
IDs

Information Retrieval 7

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

return list(all_docs_set.difference(operand_set)) # Return documents not in

the operand set

# Example queries
query1 = ["apple", "banana"] # Query for documents containing both "apple"
and "banana"
query2 = ["apple", "orange"] # Query for documents containing "apple" or
"orange"

# Performing Boolean Model queries using inverted index

result1 = boolean_and(query1, inverted_index) # Get documents containing
both terms
result2 = boolean_or(query2, inverted_index) # Get documents containing
either of the terms
result3 = boolean_not("orange", inverted_index, len(documents)) # Get
documents not containing "orange"

# Printing results
print("Documents containing 'apple' and 'banana':", result1)
print("Documents containing 'apple' or 'orange':", result2)
print("Documents not containing 'orange':", result3)
print("Performed by 740_Pallavi & 743_Deepak")

Output:

Information Retrieval 8

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

B) Implement the vector space model with TF-IDF weighting and

cosine similarity:
Input:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Import necessary libraries
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
import numpy as np # Import NumPy library
from numpy.linalg import norm # Import norm function from NumPy's linear
algebra module

# Define the training and test sets of text documents

train_set = ["The sky is blue.", "The sun is bright."] # Documents
test_set = ["The sun in the sky is bright."] # Query

# Get the stopwords for English language from NLTK

nltk.download('stopwords')
stopWords = stopwords.words('english')

# Initialize CountVectorizer and TfidfTransformer objects

vectorizer = CountVectorizer(stop_words=stopWords) # CountVectorizer to
convert text to matrix of token counts
transformer = TfidfTransformer() # TfidfTransformer to convert matrix of
token counts to TF-IDF representation

# Convert the training and test sets to arrays of TF-IDF features

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() # Fit-
transform training set
testVectorizerArray = vectorizer.transform(test_set).toarray() # Transform test
set

# Display the TF-IDF arrays for training and test sets

Information Retrieval 9

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

print('Fit Vectorizer to train set', trainVectorizerArray)

print('Transform Vectorizer to test set', testVectorizerArray)

# Define a lambda function to calculate cosine similarity between vectors

cx = lambda a, b: round(np.inner(a, b) / (norm(a) * norm(b)), 3)

# Iterate through each vector in the training set

for vector in trainVectorizerArray:
print(vector) # Display each vector in the training set
# Iterate through each vector in the test set
for testV in testVectorizerArray:
print(testV) # Display each vector in the test set
cosine = cx(vector, testV) # Calculate cosine similarity between vectors
print(cosine) # Display the cosine similarity

# Fit the transformer to the training set and transform it to TF-IDF

representation
transformer.fit(trainVectorizerArray)
print()
print(transformer.transform(trainVectorizerArray).toarray())

# Fit the transformer to the test set and transform it to TF-IDF representation
transformer.fit(testVectorizerArray)
print()
tfidf = transformer.transform(testVectorizerArray)
print(tfidf.todense())

Information Retrieval 10

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Output:

Information Retrieval 11

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 3

Aim: Spelling Correction in IR Systems

● Develop a spelling correction module using edit distance algorithms.
● Integrate the spelling correction module into an information
retrieval system.

Theory:

Edit Distance :

● Edit distance is a measure of the similarity between two strings by

calculating the minimum number of single-character edits (insertions,
deletions, or substitutions) required to change one string into the other.
● The smaller the edit distance, the more similar the strings

are. Consider two strings str1 and str2 of length M and N

respectively.

For finding edit distance there are performed below operations -

1. Operation 1 (INSERT): Insert any character before or after
any index value
2. Operation 2 (REMOVE): Remove a character
3. Operation 3 (Replace): Replace a character at any index value
with some other character

Practical:
Input:
# A Naive recursive python program to find minimum number
# operations to convert str1 to str2
def editDistance(str1, str2, m, n):
# If first string is empty, the only option is to insert all characters of second
string into first
if m == 0:
return
n
# If second string is empty, the only option is to remove all characters of
first string
if n == 0:

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS
Information Retrieval 12

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

return m
# If last characters of two strings are same, nothing much to do. Ignore last
characters and get count for remaining strings.
if str1[m-1] == str2[n-1]:
return editDistance(str1, str2, m-1, n-1)
# If last characters are not same, consider all three operations on last
character of first string, recursively compute minimum cost for all three
operations and take minimum of three values.
return 1 + min(editDistance(str1, str2, m, n-1), # Insert
editDistance(str1, str2, m-1, n), # Remove
editDistance(str1, str2, m-1, n-1) # Replace
)
# Driver code
str1 = "sunday"
str2 = "saturday"
print('Edit Distance is: ', editDistance(str1, str2, len(str1), len(str2)))

Output:

Information Retrieval 13

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 4

Aim: Evaluation Metrics for IR Systems

A) Calculate precision, recall, and F-measure for a given set of
retrieval results.
B) Use an evaluation toolkit to measure average precision and
other evaluation metrics.

Theory:
1. Precision:
● Precision is the ratio of correctly predicted positive observations
to the total predicted positives.
● It is also called Positive Predictive Value (PPV).
● Precision is calculated using the following

formula: Precision = TP / TP+FP

Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FP (False Positives) is the number
of instances incorrectly predicted as positive.
High precision indicates that the model has a low rate of false positives. In
other words, when the model predicts a positive result, it is likely to be
correct.

2. Recall:
• Recall is the ratio of correctly predicted positive observations to
all observations in actual class.
• Recall is calculated using the following
formula: Recall= TP /TP+FN
Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FN (False Negatives) is the number
of instances incorrectly predicted as negative.
High recall indicates that the model has a low rate of false negatives. In

Information Retrieval 14

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

other words, the model is effective at capturing all the positive instances.

3. F-measure:
• The F-measure is a metric commonly used in performance evaluation.
• It combines precision and recall into a single value, providing
a balanced measure of a model's performance.

• The formula for F-measure is:

• The F-measure ranges from 0 to 1, where 1 indicates perfect precision

and recall.

4. Average Precision:
Average Precision is used to find the Average of the model precision based

on relevancy of result. • Algorithm:

In order to find Average Precision:

1) Take 2 variables X and Y as 0
2) We will then go through the prediction from left to right:
3) In case the prediction is 0, we will only increment Y by 1 and
not find prediction score
4) In case the prediction is 1, we will increment both X and Y by 1
5) After incrementing, we use the formula X/Y to get the current
position prediction score.
6) Lastly we will find summation of all prediction scores and divide
them by total number of positive predictions.

Information Retrieval 15

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

A) Calculate precision, recall, and F-measure for a given set of

retrieval results.

Input:

def calculate_metrics(retrieved_set, relevant_set):

true_positive = len(retrieved_set.intersection(relevant_set))
false_positive = len(retrieved_set.difference(relevant_set))
false_negative = len(relevant_set.difference(retrieved_set))

'''
(Optional)
PPT values:
true_positive = 20
false_positive = 10
false_negative = 30
'''

print("True Positive: ", true_positive

,"\nFalse Positive: ", false_positive
,"\nFalse Negative: ", false_negative ,"\n")

precision = true_positive / (true_positive + false_positive)

recall = true_positive / (true_positive + false_negative)
f_measure = 2 * precision * recall / (precision + recall)

return precision, recall, f_measure

retrieved_set = set(["doc1", "doc2", "doc3"]) #Predicted set

relevant_set = set(["doc1", "doc4"]) #Actually Needed set (Relevant)

precision, recall, f_measure = calculate_metrics(retrieved_set, relevant_set)

Information Retrieval 16

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F-measure:
{f_measure}")

Output:

B) Use an evaluation toolkit to measure average precision and

other evaluation metrics.

Input:

from sklearn.metrics import average_precision_score

y_true = [0, 1, 1, 0, 1, 1] #Binary Prediction

y_scores = [0.1, 0.4, 0.35, 0.8, 0.65, 0.9] #Model's estimation score

average_precision = average_precision_score(y_true, y_scores)

print(f'Average precision-recall score: {average_precision}'

Output:

Information Retrieval 17
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

Information Retrieval 18

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 5

Aim: Text Categorization

A) Implement a text classification algorithm (e.g., Naive Bayes
or Support Vector Machines).
B) Train the classifier on a labelled dataset and evaluate
its performance.

Theory:
Naive Bayes
• The Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-
dimensional training dataset.
• Naive Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
• Some popular examples of Naive Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Bayes' Theorem:

• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is

used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.

• The formula for Bayes' theorem is given as:

P(B|A) *
P(A|B)
= P(A) P(B)

Create two CSV file:

Dataset.csv file:

Information Retrieval 19

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical:

Input:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the CSV file

df = pd.read_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Dataset.csv")
data = df["covid"] + "" + df["fever"]
X = data.astype(str) # Test
data y = df['flu'] # Labels

# Splitting the data into training and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 42)

# Converting data into bag-of-data format to train the model

vectorizer = CountVectorizer()

Information Retrieval 20

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

# initializing the converter

X_train_counts = vectorizer.fit_transform(X_train)
# converting the training data
X_test_counts = vectorizer.transform(X_test)
# converting the test data
# using and training the multinomial model of naive bayes
algorithm classifier = MultinomialNB() # initializing the
classifier classifier.fit(X_train_counts, y_train) # training the
classifier

# loading another dataset to test if the model is working properly

data1 = pd.read_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Test.csv")
new_data = data1["covid"] + "" + data1["fever"]
new_data_counts = vectorizer.transform(new_data.astype(str)) #
converting the new data

# making the model to predict the results for new dataset

predictions = classifier.predict(new_data_counts)

# Output the results

new_data = predictions
print(new_data)

# retrieving the accuracy and classification report

accuracy = accuracy_score(y_test, classifier.predict(X_test_counts)) print(f"\
nAccuracy: {accuracy:.2f}")

print("Classification Report: ")

print(classification_report(y_test, classifier.predict(X_test_counts)))

# Convert the predictions to a DataFrame

predictions_df = pd.DataFrame(predictions, columns = ['flu_prediction'])

Information Retrieval 21
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

# concatenate the original DataFrame with the predictions DataFrame

data1 = pd.concat([data1, predictions_df], axis = 1)

# write the DataFrame back to CSV data1.to_csv(r"C:\Users\Administrator\

Documents\Sem 6\IR\Test1.csv", index
= False)

Output:

Information Retrieval 22

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 6

Aim: Clustering for Information Retrieval

Implement a clustering algorithm (e.g., K-means or
hierarchical clustering).
Apply the clustering algorithm to a set of documents and evaluate the
clustering results.

Theory:
K-Means Clustering:
● K-Means Clustering is an Unsupervised Learning algorithm,
which groups the unlabeled dataset into different clusters.
● Here K defines the number of predefined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for
K=3, there will be three clusters, and so on.
● It allows us to cluster the data into different groups and a convenient
way to discover the categories of groups in the unlabeled dataset on its
own without the need for any training.
● The main aim of this algorithm is to minimise the sum of
distances between the data point and their corresponding clusters.
● The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.

● The k-means clustering algorithm mainly performs two tasks:

1. Determines the best value for K centre points or centroids by
an iterative process.
2. Assigns each data point to its closest k-centre. Those data
points which are near to the particular k-centre, create a cluster.

Working of K-means Algorithm -

Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be different from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.

Information Retrieval 23
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

Step-5: Repeat the third steps, which means assign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Practical

Input:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = ["Cats are known for their agility and grace", #cat doc1
"Dogs are often called ‘man’s best friend’.", #dog doc1
"Some dogs are trained to assist people with disabilities.", #dog doc2
"The sun rises in the east and sets in the west.", #sun doc1
"Many cats enjoy climbing trees and chasing toys.", #cat doc2
]

# Create a TfidfVectorizer object

vectorizer = TfidfVectorizer(stop_words='english')

# Learn vocabulary and idf from training set.

X = vectorizer.fit_transform(documents)

# Perform k-means clustering

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

# Print cluster labels for each document

print(kmeans.labels_)

Output:

Information Retrieval 24

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 7

Aim: Web Crawling and Indexing

A) Develop a web crawler to fetch and index web pages.
B) Handle challenges such as robots.txt, dynamic content, and
crawling delays.

Theory

Crawling: Google downloads text, images, and videos from pages it found on
the internet with automated programs called crawlers.

Indexing: Google analyses the text, images, and video files on the page, and
stores the information in the Google index, which is a large database.

Crawling Process -
1) Starting Point: The crawling process usually begins with a set of seed
URLs, which can be provided manually or generated through
algorithms. These URLs serve as the starting points for the web
crawlers.

2) Queue of URLs: Web crawlers maintain a queue of URLs, often

referred to as the URL frontier. This queue represents the set of URLs
that the crawler is yet to visit. New URLs are continuously added to this
queue during the crawling process.

3) Parsing Content: When a web crawler visits a webpage, it parses the

HTML content to extract links (URLs) embedded within the page.
This process involves examining HTML tags and, in some cases,
executing JavaScript to discover additional links.

4) Respecting Directives: Crawlers adhere to rules specified in the

robots.txt file, which indicates areas of a website that should not be
crawled. Additionally, crawlers may implement policies to filter
out certain types of content or URLs.

Information Retrieval 25
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

5) Avoiding Redundancy: To prevent redundancy and ensure

efficient crawling, duplicate URLs are often identified and removed
from the crawling queue.

6) Traversal Strategy: Crawlers can follow various traversal strategies,

such as depth-first or breadth-first, as they explore the web. The
chosen strategy determines the order in which pages are crawled.

7) Politeness: To avoid overloading web servers with too many

requests, crawlers may introduce a crawl delay between successive
requests.

8) Retrieving Web Pages: Crawlers download the content of web

pages, including HTML, text, images, and other resources. The
downloaded content is then processed for indexing.

Indexing Process -
1) HTML Parsing: The content retrieved by the crawler is parsed to
extract relevant information. This involves analysing the HTML
structure to identify text, metadata, and other elements on the page.

2) Isolating Textual Content: From the parsed content, the crawler

isolates the textual information, such as the body of the page, headings,
and other relevant textual data.

3) Breaking into Tokens: The textual content is tokenized, breaking

it down into smaller units, typically words or phrases. Tokenization
facilitates efficient indexing and retrieval based on keywords.

4) Data Organization: The extracted information is organized into an

index, which is a structured database allowing for fast and efficient
retrieval. The index includes details about keywords, their locations,
and other relevant metadata.

5) Optimising for Retrieval: The index is often organised as an

inverted index, mapping each term to the documents or web pages
where it

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS
Information Retrieval 26

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

appears. This structure enables quick retrieval of documents containing

specific terms.

6) Additional Information: In addition to textual content, metadata such as

page title, URL, and other relevant details may be included in the index
to enhance the search experience.
7) Real-time Changes: Search engines continuously update their indexes
to reflect changes on the web. This ensures that the search results remain
current and accurate.

Practical

Input:
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser

def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110
Safari/537.3'}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as errh:
print(f"HTTP Error: {errh}")
except requests.exceptions.RequestException as err:
print(f"Request Error: {err}")
return None

Information Retrieval 27

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

def
save_robots_txt(url):
try:
robots_url = urljoin(url, '/robots.txt')
robots_content = get_html(robots_url)
if robots_content:
with open('robots.txt', 'wb') as file:
file.write(robots_content.encode('utf-8-sig'))
except Exception as e:
print(f"Error saving robots.txt: {e}")

def
load_robots_txt():
try:
with open('robots.txt', 'rb') as file:
return file.read().decode('utf-8-
sig')
except
FileNotFoundError:
return None

def extract_links(html, base_url):

soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
absolute_url = urljoin(base_url, link['href'])
links.append(absolute_url)
return links

def is_allowed_by_robots(url, robots_content):

parser = RobotFileParser()
parser.parse(robots_content.split('\n'))
return parser.can_fetch('*', url)

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS
def crawl(start_url, max_depth=3, delay=1):

Information Retrieval 28

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

visited_urls = set()

def recursive_crawl(url, depth, robots_content):

if depth > max_depth or url in visited_urls or not
is_allowed_by_robots(url, robots_content):
return
visited_urls.add(url)
time.sleep(delay)

html = get_html(url)
if html:
print(f"Crawling {url}")
links = extract_links(html, url)

for link in links:

recursive_crawl(link, depth + 1, robots_content)

save_robots_txt(start_url)
robots_content =
load_robots_txt() if not
robots_content:
print("Unable to retrieve robots.txt. Crawling without restrictions.")

recursive_crawl(start_url, 1, robots_content)

# Example usage:
print("Performed by 740_Pallavi & 743_Deepak")
crawl('https://wikipedia.com', max_depth=2, delay=2)

Information Retrieval 29
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

Output:

robot.txt file:

Information Retrieval 30

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Practical No: 8

Aim: Link Analysis and PageRank

A) Implement the PageRank algorithm to rank web pages based on
link analysis.
B) Apply the PageRank algorithm to a small web graph and analyse
the results.

Theory

Link Analysis:

Information Retrieval 31

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

● Link analysis is a method used to examine relationships and

connections between entities in a network.
● It involves studying the links or connections between different
elements to uncover patterns, structures, and insights.
● Link analysis is commonly applied in various fields, including
information retrieval, social network analysis, fraud detection,
and recommendation systems.

PageRank Algorithm -
● The PageRank algorithm is an algorithm used by the Google
search engine to rank web pages in its search results.
● It was developed by Larry Page and Sergey Brin, the co-founders of
Google, and is named after Larry Page.
● PageRank is based on the idea that the importance of a webpage is
determined by the number and quality of other pages that link to
it.

Working -

1) Initialize PageRank Values:

Set an initial PageRank value for each node. Commonly, this is initialised
to 1 divided by the total number of nodes, making the sum of all
PageRank values equal to 1.

2) Define Damping Factor and Iterations:

Choose a damping factor (typically 0.85) to model the probability that a
user will continue navigating through the web by following links.
Decide on the maximum number of iterations for the algorithm.

Information Retrieval 32
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

4) Convergence Check:
After each iteration, check for convergence. If the difference between the new
and previous PageRank values falls below a certain threshold, the algorithm has
converged, and you can stop iterating.

5) Repeat Iterations:
Continue iterating until the maximum number of iterations is reached or until
convergence is achieved.

6) Final PageRank Values:

The final PageRank values represent the importance of each node in the graph
based on the link structure.

Practical

Input:
import numpy as np

def page_rank(graph, damping_factor=0.85, max_iterations=100, tolerance=1e-

6):
# Get the number of
nodes num_nodes =
len(graph)

# Initialize PageRank values

page_ranks = np.ones(num_nodes) / num_nodes

# Iterative PageRank calculation

for _ in range(max_iterations):
prev_page_ranks = np.copy(page_ranks)

for node in range(num_nodes):

# Calculate the contribution from incoming links
incoming_links = [i for i, v in enumerate(graph) if node in v]
if not incoming_links:

Information Retrieval 33
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS

continue

page_ranks[node] = (1 - damping_factor) / num_nodes + \

damping_factor * sum(prev_page_ranks[link] /
len(graph[link]) for link in incoming_links)

# Check for convergence

if np.linalg.norm(page_ranks - prev_page_ranks, 2) < tolerance:
break

return page_ranks

# Example usage
if name == " main ":
# Define a simple directed graph as an adjacency list
# Each index represents a node, and the list at that index contains nodes to
which it has outgoing links
web_graph = [
[1, 2], # Node 0 has links to Node 1 and Node 2
[0, 2], # Node 1 has links to Node 0 and Node 2
[0, 1] , # Node 2 has links to Node 0 and Node 1
[1,2], # Node 3 has links to Node 1 and Node 2
]

# Calculate PageRank
result = page_rank(web_graph)

# Display PageRank values

for i, pr in enumerate(result):
print(f"Page {i}: {pr}")

Information Retrieval 34

Downloaded by deva maurya

Mrs. Chanderi S. Sarkale T.Y.BSc CS

Output:

Information Retrieval 35

Downloaded by deva maurya

CS726 - Solution Manual - Introduction To Information Retrieval
100% (2)
CS726 - Solution Manual - Introduction To Information Retrieval
49 pages
SMETA 6.1 Non Compliance Guidance Ver. 2.1 - May 2019
100% (10)
SMETA 6.1 Non Compliance Guidance Ver. 2.1 - May 2019
63 pages
Temperature Mapping Protocol For RM Quarantine
100% (2)
Temperature Mapping Protocol For RM Quarantine
20 pages
Introduction To Information Rertrieval Answer
100% (4)
Introduction To Information Rertrieval Answer
6 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
ir
No ratings yet
ir
120 pages
IR Journal
No ratings yet
IR Journal
36 pages
inverted index-unit-3
No ratings yet
inverted index-unit-3
11 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
IR Practical 1
No ratings yet
IR Practical 1
5 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Written Assignmen Unit Four IR (1)
No ratings yet
Written Assignmen Unit Four IR (1)
3 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Ir 1
No ratings yet
Ir 1
14 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
IR practical
No ratings yet
IR practical
24 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Unit 2
No ratings yet
Unit 2
58 pages
115 Ir 9
No ratings yet
115 Ir 9
4 pages
IR Cheatsheet Final
No ratings yet
IR Cheatsheet Final
3 pages
IR Practical Theory.docx
No ratings yet
IR Practical Theory.docx
9 pages
IR_Prac_1
No ratings yet
IR_Prac_1
3 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
Introduction To Information Rertrieval Recitation
No ratings yet
Introduction To Information Rertrieval Recitation
2 pages
20230922044043-Chapter 1
No ratings yet
20230922044043-Chapter 1
4 pages
Tamirat.IRS
No ratings yet
Tamirat.IRS
7 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
indexing_1
No ratings yet
indexing_1
61 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
NLP-week10-IR-enc-dec-annotated_by_Ces
No ratings yet
NLP-week10-IR-enc-dec-annotated_by_Ces
83 pages
NLP-week10-IR-enc-dec
No ratings yet
NLP-week10-IR-enc-dec
68 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Supervisionguide16 17 Students
No ratings yet
Supervisionguide16 17 Students
17 pages
01 Intro
No ratings yet
01 Intro
145 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Implementation
No ratings yet
Implementation
16 pages
Assignments 1 Solution
100% (1)
Assignments 1 Solution
6 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
vanessaa_wim
No ratings yet
vanessaa_wim
9 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
UNIT-2
No ratings yet
UNIT-2
36 pages
UNIT-3
No ratings yet
UNIT-3
13 pages
Information
No ratings yet
Information
61 pages
Ethical
No ratings yet
Ethical
41 pages
pp510 No Print PDF
No ratings yet
pp510 No Print PDF
1 page
As Built - CP Drawing For Well Head Area
No ratings yet
As Built - CP Drawing For Well Head Area
1 page
FIN1000 Module06 Forecasting Assignment
No ratings yet
FIN1000 Module06 Forecasting Assignment
4 pages
Affidavit of Surviving Spouse
No ratings yet
Affidavit of Surviving Spouse
1 page
732A94 AdvancedRHT2024 Lab05
No ratings yet
732A94 AdvancedRHT2024 Lab05
6 pages
Reforms of Ataturk : Kamalism
No ratings yet
Reforms of Ataturk : Kamalism
8 pages
Slide 1 Pengenalan Social Computing
No ratings yet
Slide 1 Pengenalan Social Computing
37 pages
MPEG-2 - Tha Basis of How It Works
No ratings yet
MPEG-2 - Tha Basis of How It Works
17 pages
2G3G4G PL & KPI Counters
No ratings yet
2G3G4G PL & KPI Counters
276 pages
Software Requirements Specification: E-Tutorial For Engineering Students of IPU
No ratings yet
Software Requirements Specification: E-Tutorial For Engineering Students of IPU
9 pages
11 All India Peoples' Technology Congress: Forum of Scientists, Engineers & Technologists (Foset)
No ratings yet
11 All India Peoples' Technology Congress: Forum of Scientists, Engineers & Technologists (Foset)
5 pages
Legal, Ethical & Professional Issues in Information Security
No ratings yet
Legal, Ethical & Professional Issues in Information Security
33 pages
Francisco Vs People G.R. No. 177720 February 18, 2009
No ratings yet
Francisco Vs People G.R. No. 177720 February 18, 2009
7 pages
History of Dalda
50% (2)
History of Dalda
11 pages
Swiss Army EA v1.5 Features
No ratings yet
Swiss Army EA v1.5 Features
2 pages
Mechanical Effects of Short-Circuit Currents Analysis On Autotransformer Windings
No ratings yet
Mechanical Effects of Short-Circuit Currents Analysis On Autotransformer Windings
4 pages
Where can buy The Outrage Industry Political Opinion Media and the New Incivility 1st Edition Jeffrey M. Berry ebook with cheap price
100% (4)
Where can buy The Outrage Industry Political Opinion Media and the New Incivility 1st Edition Jeffrey M. Berry ebook with cheap price
71 pages
Achievement - Progress Chart - Dayo
No ratings yet
Achievement - Progress Chart - Dayo
3 pages
T5300 Titan Superflex White
No ratings yet
T5300 Titan Superflex White
8 pages
Summer2018-Unit 1 QP Democracy in Action (2) Half Way
No ratings yet
Summer2018-Unit 1 QP Democracy in Action (2) Half Way
16 pages
14 Thermal Lab 2009
No ratings yet
14 Thermal Lab 2009
7 pages
Data Analysis
No ratings yet
Data Analysis
57 pages
Narrowing The Gap: The Effects of Steady Traffic Growth & Evolution of Scope Clauses
No ratings yet
Narrowing The Gap: The Effects of Steady Traffic Growth & Evolution of Scope Clauses
64 pages
Scientech 71: Best Choice For Your PCB Lab
No ratings yet
Scientech 71: Best Choice For Your PCB Lab
2 pages
Simba Win Against Coca Cola - Court Order
No ratings yet
Simba Win Against Coca Cola - Court Order
3 pages
Severance Pay and Service Pay
No ratings yet
Severance Pay and Service Pay
3 pages
1 - 5G Radio Network Key Technologies and Deployment
100% (4)
1 - 5G Radio Network Key Technologies and Deployment
85 pages
Deploy Secure Postfix Messaging Server With OpenLDAP For Addressbook & Identity Management v1.1
100% (1)
Deploy Secure Postfix Messaging Server With OpenLDAP For Addressbook & Identity Management v1.1
15 pages