ir-journal
ir-journal
Practical No: 1
Aim: Document Indexing and Retrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the
constructed index.
Theory :
● An Inverted Index is a data structure used in information retrieval
systems to efficiently retrieve documents or web pages containing
a specific term or set of terms.
● In an inverted index, the index is organised by terms (words), and
each term points to a list of documents or web pages that contain that
term.
● Inverted indexes are widely used in search engines, database systems,
and other applications where efficient text search is required.
● They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow. An
inverted index is an index data structure storing a mapping from
content, such as words or numbers, to its locations in a document or a
set of documents.
Practical:
Input:
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
Information Retrieval 1
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
inverted_index[term] = documents
Information Retrieval 3
Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and
cosine similarity.
Theory :
A) Boolean Retrieval Model -
●A Boolean model is a fundamental concept in Information
Retrieval (IR) that is used to represent and retrieve documents or
information based on Boolean logic.
3. Boolean Operators:
●AND: "cats AND dogs,"
both "cats" and "dogs" will be retrieved.
● OR: "cats OR dogs,"
"cats" or "dogs" or both will be retrieved.
● NOT: "cats NOT
dogs" "cats" but not
"dogs."
B) TF-IDF
● Term Frequency - Inverse Document Frequency (TF-IDF) is
a widely used statistical method in information retrieval.
● It measures how important a term is within a
document relative to a collection of documents.
C) Cosine Similarity -
Information Retrieval 5
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical:
documents = {
1: "apple banana orange",
2: "apple banana",
3: "banana orange",
4: "apple"
}
Information Retrieval 6
result = index.get(operands[0], set()) # Get the set of document IDs for the
first operand
for term in operands[1:]: # Iterate through the rest of the operands
result = result.intersection(index.get(term, set())) # Compute intersection
with sets of document IDs
return list(result) # Return the resulting list of document IDs
Information Retrieval 7
# Example queries
query1 = ["apple", "banana"] # Query for documents containing both "apple"
and "banana"
query2 = ["apple", "orange"] # Query for documents containing "apple" or
"orange"
# Printing results
print("Documents containing 'apple' and 'banana':", result1)
print("Documents containing 'apple' or 'orange':", result2)
print("Documents not containing 'orange':", result3)
print("Performed by 740_Pallavi & 743_Deepak")
Output:
Information Retrieval 8
Information Retrieval 9
# Fit the transformer to the test set and transform it to TF-IDF representation
transformer.fit(testVectorizerArray)
print()
tfidf = transformer.transform(testVectorizerArray)
print(tfidf.todense())
Information Retrieval 10
Output:
Information Retrieval 11
Practical No: 3
Theory:
Edit Distance :
respectively.
Practical:
Input:
# A Naive recursive python program to find minimum number
# operations to convert str1 to str2
def editDistance(str1, str2, m, n):
# If first string is empty, the only option is to insert all characters of second
string into first
if m == 0:
return
n
# If second string is empty, the only option is to remove all characters of
first string
if n == 0:
return m
# If last characters of two strings are same, nothing much to do. Ignore last
characters and get count for remaining strings.
if str1[m-1] == str2[n-1]:
return editDistance(str1, str2, m-1, n-1)
# If last characters are not same, consider all three operations on last
character of first string, recursively compute minimum cost for all three
operations and take minimum of three values.
return 1 + min(editDistance(str1, str2, m, n-1), # Insert
editDistance(str1, str2, m-1, n), # Remove
editDistance(str1, str2, m-1, n-1) # Replace
)
# Driver code
str1 = "sunday"
str2 = "saturday"
print('Edit Distance is: ', editDistance(str1, str2, len(str1), len(str2)))
Output:
Information Retrieval 13
Practical No: 4
Theory:
1. Precision:
● Precision is the ratio of correctly predicted positive observations
to the total predicted positives.
● It is also called Positive Predictive Value (PPV).
● Precision is calculated using the following
2. Recall:
• Recall is the ratio of correctly predicted positive observations to
all observations in actual class.
• Recall is calculated using the following
formula: Recall= TP /TP+FN
Where:
• TP (True Positives) is the number of instances correctly
predicted as positive. • FN (False Negatives) is the number
of instances incorrectly predicted as negative.
High recall indicates that the model has a low rate of false negatives. In
Information Retrieval 14
other words, the model is effective at capturing all the positive instances.
3. F-measure:
• The F-measure is a metric commonly used in performance evaluation.
• It combines precision and recall into a single value, providing
a balanced measure of a model's performance.
4. Average Precision:
Average Precision is used to find the Average of the model precision based
Information Retrieval 15
Input:
'''
(Optional)
PPT values:
true_positive = 20
false_positive = 10
false_negative = 30
'''
Information Retrieval 16
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F-measure:
{f_measure}")
Output:
Input:
Information Retrieval 17
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Information Retrieval 18
Practical No: 5
Theory:
Naive Bayes
• The Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-
dimensional training dataset.
• Naive Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
• Some popular examples of Naive Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Bayes' Theorem:
P(B|A) *
P(A|B)
= P(A) P(B)
Information Retrieval 19
Practical:
Input:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
Information Retrieval 20
Information Retrieval 21
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Output:
Information Retrieval 22
Practical No: 6
Theory:
K-Means Clustering:
● K-Means Clustering is an Unsupervised Learning algorithm,
which groups the unlabeled dataset into different clusters.
● Here K defines the number of predefined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for
K=3, there will be three clusters, and so on.
● It allows us to cluster the data into different groups and a convenient
way to discover the categories of groups in the unlabeled dataset on its
own without the need for any training.
● The main aim of this algorithm is to minimise the sum of
distances between the data point and their corresponding clusters.
● The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.
Information Retrieval 23
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Step-5: Repeat the third steps, which means assign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Practical
Input:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = ["Cats are known for their agility and grace", #cat doc1
"Dogs are often called ‘man’s best friend’.", #dog doc1
"Some dogs are trained to assist people with disabilities.", #dog doc2
"The sun rises in the east and sets in the west.", #sun doc1
"Many cats enjoy climbing trees and chasing toys.", #cat doc2
]
Output:
Information Retrieval 24
Practical No: 7
Theory
Crawling: Google downloads text, images, and videos from pages it found on
the internet with automated programs called crawlers.
Indexing: Google analyses the text, images, and video files on the page, and
stores the information in the Google index, which is a large database.
Crawling Process -
1) Starting Point: The crawling process usually begins with a set of seed
URLs, which can be provided manually or generated through
algorithms. These URLs serve as the starting points for the web
crawlers.
Information Retrieval 25
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Indexing Process -
1) HTML Parsing: The content retrieved by the crawler is parsed to
extract relevant information. This involves analysing the HTML
structure to identify text, metadata, and other elements on the page.
Practical
Input:
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110
Safari/537.3'}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as errh:
print(f"HTTP Error: {errh}")
except requests.exceptions.RequestException as err:
print(f"Request Error: {err}")
return None
Information Retrieval 27
def
save_robots_txt(url):
try:
robots_url = urljoin(url, '/robots.txt')
robots_content = get_html(robots_url)
if robots_content:
with open('robots.txt', 'wb') as file:
file.write(robots_content.encode('utf-8-sig'))
except Exception as e:
print(f"Error saving robots.txt: {e}")
def
load_robots_txt():
try:
with open('robots.txt', 'rb') as file:
return file.read().decode('utf-8-
sig')
except
FileNotFoundError:
return None
Information Retrieval 28
visited_urls = set()
html = get_html(url)
if html:
print(f"Crawling {url}")
links = extract_links(html, url)
save_robots_txt(start_url)
robots_content =
load_robots_txt() if not
robots_content:
print("Unable to retrieve robots.txt. Crawling without restrictions.")
recursive_crawl(start_url, 1, robots_content)
# Example usage:
print("Performed by 740_Pallavi & 743_Deepak")
crawl('https://wikipedia.com', max_depth=2, delay=2)
Information Retrieval 29
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Output:
robot.txt file:
Information Retrieval 30
Practical No: 8
Theory
Link Analysis:
Information Retrieval 31
PageRank Algorithm -
● The PageRank algorithm is an algorithm used by the Google
search engine to rank web pages in its search results.
● It was developed by Larry Page and Sergey Brin, the co-founders of
Google, and is named after Larry Page.
● PageRank is based on the idea that the importance of a webpage is
determined by the number and quality of other pages that link to
it.
Working -
Information Retrieval 32
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
4) Convergence Check:
After each iteration, check for convergence. If the difference between the new
and previous PageRank values falls below a certain threshold, the algorithm has
converged, and you can stop iterating.
5) Repeat Iterations:
Continue iterating until the maximum number of iterations is reached or until
convergence is achieved.
Practical
Input:
import numpy as np
Information Retrieval 33
Downloaded by deva maurya
Mrs. Chanderi S. Sarkale T.Y.BSc CS
continue
return page_ranks
# Example usage
if name == " main ":
# Define a simple directed graph as an adjacency list
# Each index represents a node, and the list at that index contains nodes to
which it has outgoing links
web_graph = [
[1, 2], # Node 0 has links to Node 1 and Node 2
[0, 2], # Node 1 has links to Node 0 and Node 2
[0, 1] , # Node 2 has links to Node 0 and Node 1
[1,2], # Node 3 has links to Node 1 and Node 2
]
# Calculate PageRank
result = page_rank(web_graph)
Information Retrieval 34
Output:
Information Retrieval 35