0% found this document useful (0 votes)

37 views22 pages

Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)

This document discusses elementary information retrieval and compares it to relational database management systems. It describes how early IR systems used inverted indexes to perform boolean keyword searches on documents. While IR and DBMSs were originally separate, under the hood they are not as different - both use indexes and logical/physical schemas. The document outlines how a simple relational inverted index can be built to support boolean keyword searching using set operations like union and intersection. It also discusses some ways IR has advanced, like handling phrases and proximity searches.

Uploaded by

raw.junk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views22 pages

Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)

Uploaded by

raw.junk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 22

Elementary IR:

Scalable Boolean
Text Search
(Compare with R & G 27.13)

Information Retrieval: History

A research field traditionally separate from
Databases
Hans P. Luhn, IBM, 1959: Keyword in Context (KWIC)
G. Salton at Cornell in the 60s/70s: SMART
Around the same time as relational DB revolution

Tons of research since then

Especially in the web era

Products traditionally separate

Originally, document management systems for
libraries, government, law, etc.
Gained prominence in recent years due to web search
Still used for non-web document management.
(Enterprise search).

Today: Simple (nave!) IR

Boolean Search on keywords
Goal:
Show that you already have the tools to do this from your
study of relational DBs
Well skip:
Text-oriented storage formats
Intelligent result ranking (hopefully later!)
Parallelism
Critical for modern relational DBs too

Various bells and whistles (lots of little ones!)

Engineering the specifics of (written) human language

E.g.
E.g.
E.g.
E.g.

dealing with tense and plurals

identifying synonyms and related words
disambiguating multiple meanings of a word
clustering output

IR vs. DBMS
Seem like very different beasts

DBMS

Imprecise Semantics

Precise Semantics

Keyword search

SQL

Unstructured data format

Structured data

Read-Mostly. Add docs

occasionally

Expect reasonable number

of updates

Page through top k

results

Generate full answer

Under the hood, not as different as they might seem

But in practice, you have to choose between the 2 today

IRs Bag of Words Model

Typical IR data model:
Each document is just a bag of words (terms)
Detail 1: Stop Words
Certain words are not helpful, so not placed in the bag
e.g. real words like the
e.g. HTML tags like <H1>
Detail 2: Stemming
Using language-specific rules, convert words to basic
form
e.g. surfing, surfed --> surf
Unfortunately have to do this for each language
Yuck!

Boolean Text Search

Find all documents that match a Boolean
containment expression:
Windows
AND (Glass OR Door)
AND NOT Microsoft
Note: query terms are also filtered via
stemming and stop words
When web search engines say 10,000
documents found, thats the Boolean
search result size
More or less ;-)

Text Indexes
When IR folks say text index
usually mean more than what DB people
mean
In our terms, both tables and indexes
Really a logical schema (i.e. tables)
With a physical schema (i.e. indexes)
Usually not stored in a DBMS
Tables implemented as files in a file system

A Simple Relational Text Index

Given: a corpus of text files
Files(docID string, content string)
Create and populate a bag of words table
InvertedFile(term string, docID string)
Build a B+-tree or Hash index on InvertedFile.term
Something like Alternative 3 critical here!!
Keep lists of dup keys sorted by docID
Will provide interesting orders later on!

Fancy list compression on the docIDs is important, too

Typically called a postings list by IR people

Note: URL instead of RID, the web is your heap file!

Can also cache pages and use RIDs

This is often called an inverted file or inverted index

Maps from words -> docs, rather than docs -> words
Given this, you can now do single-word text search queries!

Term
data
database
date
day
dbms
decision
demonstrate
description
design
desire
developer
differ
disability
discussion
division
do
document
document

http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com

microsoft
microsoft
midnight
midterm
minibase
million
monday
more
most
ms
msn
must
necessary
need

http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www.microsoft.com
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186

An Inverted File
Snippets from:
Old class web
page
Old microsoft.com
home page
Search for
databases
microsoft

docID

Handling Boolean Logic

How to do term1 OR term2?
Union of two postings lists (docID sets)!
How to do term1 AND term2?
Intersection of two postings lists!
Can be done via merge of postings lists
Remember: postings list per key sorted by docID in index

How to do term1 AND NOT term2?

Set subtraction
Also easy because sorted (basically merge logic again)

How to do term1 OR NOT term2

Union of term1 with NOT term2.
Not term2 = all docs not containing term2. Yuck!

Usually not allowed!

Optimizations: What order to handle terms if you have many
ANDs? Can you do better than merge? How does this interact
with postings list compression?

Windows AND (Glass OR Door

AND NOT Microsoft

Boolean Search in SQL

(SELECT docID FROM InvertedFile
WHERE word = window
INTERSECT
SELECT docID FROM InvertedFile
WHERE word = glass OR word = door)
EXCEPT
SELECT docID FROM InvertedFile
WHERE word=Microsoft
ORDER BY magic_rank()
Theres only one SQL query template in Boolean Search
Single-table selects, UNION, INTERSECT, EXCEPT
magic_rank() is the secret sauce in the search engines
Well study this later in the semester
Combos of statistics, linguistics, and graph theory tricks!

One step fancier: Phrases and

Near
Suppose you want a phrase
E.g. Happy Days
Different schema:
InvertedFile (term string, docID string, position int)
Index on term (sort of Alternative 3 style, with docID and
position in the postings list)
Postings lists sorted by (docID, position)
Post-process the results
Find Happy AND Days
Keep results where positions are 1 off
Can be done during merge-join to AND the 2 lists!

Can do a similar thing for term1 NEAR term2

Position < k off
Think about refinement to merge-join

For better compression

InvertedFile (term string, position int, docID int)
IDs smaller, compress better than URLS

Files(docID int, docID string, snippet string, )

Btree on InvertedFile.term
Btree on Docs.docID
Requires a final join step between typical
query result and Files.docID
Can do this lazily: cursor to generate a page full of
results

Updates and Text Search

Text search engines are designed to be query-mostly
Deletes and modifications are rare
Can postpone updates (nobody notices, no transactions!)
Can work off a union of indexes
Merge them in batch (typically re-bulk-load a new index)

Cant afford to go offline for an update?

Create a 2nd index on a separate machine
Replace the 1st index with the 2nd!

So no concurrency control problems

Can compress to search-friendly, update-unfriendly format
Can keep postings lists sorted
For these reasons, text search engines and DBMSs are usually
separate products
Also, text-search engines tune that one SQL query to death!
The benefits of a special-case workload.

Lots more tricks in IR

How to rank the output?
A mix of simple tricks works well
Some fancier tricks can help (use hyperlink graph)
Other ways to help users paw through the output?
Document clustering (e.g. Clusty.com)
Document visualization
How to use compression for better I/O performance?
E.g. making postings lists smaller
Try to make things fit in RAM
Or in processor caches

How to deal with synonyms, misspelling, abbreviations?

How to write a good web crawler?
Well return to some of these later
The book Managing Gigabytes covers some of the details

Recall From the First Lecture

Query Optimization
and Execution

Search String Modifier

Relational Operators

The Query

Files and Access Methods

The Access Method

Buffer Management
Disk Space Management

Concurrency
and
Recovery
Needed

DBMS

Ranking Algorithm

Buffer ManagementOS
Disk Space Management

SearchEngine

Simple
DBMS

You Know The Basics!

Inverted files are the workhorses of all
text search engines
Just B+-tree or Hash indexes on bag-of-words
Intersect, Union and Set Difference (Except)
Usually implemented via sorting
Or can be done with hash or index joins
Most of the other stuff is not systems
work
A lot of it is cleverness in dealing with language
Both linguistics and statistics (more the latter!)

Revisiting Our IR/DBMS

Distinctions
Semantic Guarantees on Storage
DBMS guarantees transactional semantics
If an inserting transaction commits, a subsequent query will see the update
Handles multiple concurrent updates correctly

IR systems do not do this; nobody notices!

Postpone insertions until convenient
No model of correct concurrency.
Can even return incorrect answers for various reasons!

Data Modeling & Query Complexity

DBMS supports any schema & queries
But requires you to define schema
And SQL is hard to figure out for the average citizen

IR supports only one schema & query

No schema design required (unstructured text)
Trivial (natural?) query language for simple tasks
No data correlation or analysis capabilities -- search only

Revisiting Distinctions, Cont.

Performance goals
DBMS supports general SELECT
plus mix of INSERT, UPDATE, DELETE
general purpose engine must always perform well

IR systems expect only one stylized SELECT

plus delayed INSERT, unusual DELETE, no UPDATE.
special purpose, must run super-fast on The Query
users rarely look at the full answer in Boolean
Search
Postpone any work you can to subsequent index joins
But make sure you can rank!

Summary
IR & Relational systems share basic building blocks for
scalability
IR internal representation is relational!
Equality indexes (B-trees)
Iterators
Join algorithms, esp. merge-join
Join ordering and selectivity estimation
IR constrains queries, schema, promises on semantics
Affects storage format, indexing and concurrency control
Affects join algorithms & selectivity estimation
IR has different performance goals
Ranking and best answers fast
Many challenges in IR related to text engineering
But dont tend to change the scalability infrastructure

IR Buzzwords to Know (so far!)

Learning this in the context of relational
foundations is fine, but you need to know
the IR lingo!
Corpus: a collection of documents
Term: an isolated string (searchable unit)
Index: a mechanism mapping terms to
documents
Inverted File (= Postings File): a file
containing terms and associated postings lists
Postings List: a list of pointers (postings) to
documents

Exercise!
Implement Boolean search as described in Postgres
Using the schemas and indexes here.
Write a simple script to load files.
You can ignore stemming and stop-words.

Run the SQL versions of Boolean queries

Measure how slow search is

Identify contributing factors in performance

E.g. how much disk space does this version use (including indexes) vs. the
raw documents vs. the documents gziped
E.g. is PG identifying the interesting orders in the postings lists? (use
EXPLAIN) If not, can you force it to do so?

Compare to Postgres tsearch facility

Two indexes choices, GIN and GiST. GIN is an inverted index.
Use the cost models for IndexScan and MergeJoin to calculate the
expected number of IOs. Distinguish sequential and random Ios.
Why is the nave solution slow? Storage overhead? Optimizer smarts?

Etymological Dictionary Persian PDF
0% (5)
Etymological Dictionary Persian PDF
2 pages
Introduction To Generative AI
No ratings yet
Introduction To Generative AI
2 pages
Engl 5 - Portfolio Cover Letter
No ratings yet
Engl 5 - Portfolio Cover Letter
5 pages
Sama Veda drAhyAyana Vaishnava Vadakalai Sandhyavandanam PDF
No ratings yet
Sama Veda drAhyAyana Vaishnava Vadakalai Sandhyavandanam PDF
30 pages
Lesson Plan in English 3 For Second Quarter For CO
100% (10)
Lesson Plan in English 3 For Second Quarter For CO
3 pages
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
No ratings yet
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
4 pages
Introduction To IR Systems: Supporting Boolean Text Search: Chapter 27, Part A
No ratings yet
Introduction To IR Systems: Supporting Boolean Text Search: Chapter 27, Part A
6 pages
Ch27a Ir1-Intro
No ratings yet
Ch27a Ir1-Intro
18 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Lect 01-Introduction (1)
No ratings yet
Lect 01-Introduction (1)
53 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
No ratings yet
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
14 pages
L05
No ratings yet
L05
33 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
No ratings yet
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
47 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Chap 1
No ratings yet
Chap 1
22 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
chapter 2
No ratings yet
chapter 2
45 pages
Module_1-1
No ratings yet
Module_1-1
12 pages
L01
No ratings yet
L01
33 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
bulu
No ratings yet
bulu
47 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
63 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
2 Mod-1_Lec-2
No ratings yet
2 Mod-1_Lec-2
58 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
indexing_1
No ratings yet
indexing_1
61 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Unit II
No ratings yet
Unit II
73 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
Mini Google
No ratings yet
Mini Google
34 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
Data Structures Guide
From Everand
Data Structures Guide
Alisa Turing
No ratings yet
Database Design
From Everand
Database Design
Mei Gates
No ratings yet
Recent Advances in Routing Architecture Including: Line Cards
No ratings yet
Recent Advances in Routing Architecture Including: Line Cards
11 pages
Lecture 8
No ratings yet
Lecture 8
11 pages
Line Encoding: Line Encoding Converts A Binary Information Sequence To Digital Signal
No ratings yet
Line Encoding: Line Encoding Converts A Binary Information Sequence To Digital Signal
8 pages
Lecture17 NetworkResourceAllocation
No ratings yet
Lecture17 NetworkResourceAllocation
12 pages
Web Caches, CDNS, and P2Ps
No ratings yet
Web Caches, CDNS, and P2Ps
7 pages
Lecture16 TCPOverview
No ratings yet
Lecture16 TCPOverview
12 pages
The Nodes Need To Remember Their Addresses Identify The Links To Which They Are Attached
No ratings yet
The Nodes Need To Remember Their Addresses Identify The Links To Which They Are Attached
13 pages
What Is Direct Link Networks?
No ratings yet
What Is Direct Link Networks?
6 pages
Medium Access Control
No ratings yet
Medium Access Control
8 pages
Data Link Protocols: Unrestricted Simplex Protocol Simplex Stop-And-Wait Protocol Simplex Protocol For A Noisy Channel
No ratings yet
Data Link Protocols: Unrestricted Simplex Protocol Simplex Stop-And-Wait Protocol Simplex Protocol For A Noisy Channel
6 pages
Congestion Control: Issues
No ratings yet
Congestion Control: Issues
7 pages
10 Post Notes
No ratings yet
10 Post Notes
9 pages
Inter-Domain Routing Basics: Exterior Routing Protocols Created To
No ratings yet
Inter-Domain Routing Basics: Exterior Routing Protocols Created To
14 pages
Differentials and Approximations
No ratings yet
Differentials and Approximations
6 pages
Optimization Problems
No ratings yet
Optimization Problems
8 pages
33 Post Notes
No ratings yet
33 Post Notes
11 pages
3 Post Notes
No ratings yet
3 Post Notes
6 pages
29 Post Notes
No ratings yet
29 Post Notes
6 pages
Solving Equations Numerically: 21B Numerical Solutions
No ratings yet
Solving Equations Numerically: 21B Numerical Solutions
8 pages
18 Post Notes
No ratings yet
18 Post Notes
8 pages
Length of A Curve and Surface Area
No ratings yet
Length of A Curve and Surface Area
12 pages
19 Post Notes
No ratings yet
19 Post Notes
5 pages
24 Post Notes
No ratings yet
24 Post Notes
8 pages
22 Post Notes
No ratings yet
22 Post Notes
8 pages
12 Post Notes
No ratings yet
12 Post Notes
6 pages
1 Post Notes
No ratings yet
1 Post Notes
7 pages
34 Post Notes
No ratings yet
34 Post Notes
11 pages
2.1B Riorous Study of Limits
No ratings yet
2.1B Riorous Study of Limits
7 pages
The First Fundamental Theorem of Calculus
No ratings yet
The First Fundamental Theorem of Calculus
6 pages
Birla Institute of Technology & Science, Pilani Course Handout Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani Course Handout Part A: Content Design
8 pages
Bulgaristanda Turk Koyleri Turkish Villa PDF
No ratings yet
Bulgaristanda Turk Koyleri Turkish Villa PDF
65 pages
1°&2° Grade - Test Unit 3 & 4
No ratings yet
1°&2° Grade - Test Unit 3 & 4
2 pages
Maris ECDIS 900
100% (1)
Maris ECDIS 900
23 pages
Arumuka Navalar
No ratings yet
Arumuka Navalar
15 pages
TPL: Interfaces Graphiques
No ratings yet
TPL: Interfaces Graphiques
19 pages
IC4 L3 OQ Question Bank
No ratings yet
IC4 L3 OQ Question Bank
4 pages
Hitachi VSP, VSP G and HUS-VM Setup Guide
No ratings yet
Hitachi VSP, VSP G and HUS-VM Setup Guide
6 pages
Parashah 48 Shoftim (Judges) 5777
No ratings yet
Parashah 48 Shoftim (Judges) 5777
10 pages
Series of Functions
No ratings yet
Series of Functions
236 pages
Gaisler Research IP Library
No ratings yet
Gaisler Research IP Library
1,055 pages
Understanding Adoption Clinical Work with Adults Children and Parents Kathleen Hushion 2024 scribd download
100% (1)
Understanding Adoption Clinical Work with Adults Children and Parents Kathleen Hushion 2024 scribd download
77 pages
Dlubal Software Overview
100% (3)
Dlubal Software Overview
84 pages
Has Have
No ratings yet
Has Have
5 pages
Loops 1
No ratings yet
Loops 1
20 pages
M3. 21st Century Lit Phil and World
No ratings yet
M3. 21st Century Lit Phil and World
5 pages
Footnote
No ratings yet
Footnote
21 pages
Common Errors: The Words: At, by
No ratings yet
Common Errors: The Words: At, by
4 pages
Harvey Orff Workshop
No ratings yet
Harvey Orff Workshop
8 pages
Pierre de Fermat
No ratings yet
Pierre de Fermat
3 pages
ANSI-SPARC Architecture or 3-Level Architecture
No ratings yet
ANSI-SPARC Architecture or 3-Level Architecture
20 pages
Dedan Kimathi University of Technology
No ratings yet
Dedan Kimathi University of Technology
5 pages
Soccer Block Plan
No ratings yet
Soccer Block Plan
2 pages
Clauses and Phrases
No ratings yet
Clauses and Phrases
6 pages
Abingdon Old Testament Commentaries Ecclesiastes 1st Edition Julie Ann Duncan download pdf
100% (2)
Abingdon Old Testament Commentaries Ecclesiastes 1st Edition Julie Ann Duncan download pdf
65 pages

Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)

Uploaded by

Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)

Uploaded by

Elementary IR:

Information Retrieval: History

Tons of research since then

Products traditionally separate

Today: Simple (nave!) IR

Various bells and whistles (lots of little ones!)

dealing with tense and plurals

Unstructured data format

Read-Mostly. Add docs

Expect reasonable number

Page through top k

Generate full answer

Under the hood, not as different as they might seem

IRs Bag of Words Model

Boolean Text Search

A Simple Relational Text Index

Fancy list compression on the docIDs is important, too

Note: URL instead of RID, the web is your heap file!

This is often called an inverted file or inverted index

Handling Boolean Logic

How to do term1 AND NOT term2?

How to do term1 OR NOT term2

Usually not allowed!

Windows AND (Glass OR Door

Boolean Search in SQL

One step fancier: Phrases and

Can do a similar thing for term1 NEAR term2

For better compression

Files(docID int, docID string, snippet string, )

Updates and Text Search

Cant afford to go offline for an update?

So no concurrency control problems

Lots more tricks in IR

How to deal with synonyms, misspelling, abbreviations?

Recall From the First Lecture

Search String Modifier

Files and Access Methods

The Access Method

You Know The Basics!

Revisiting Our IR/DBMS

IR systems do not do this; nobody notices!

Data Modeling & Query Complexity

IR supports only one schema & query

Revisiting Distinctions, Cont.

IR systems expect only one stylized SELECT

IR Buzzwords to Know (so far!)

Run the SQL versions of Boolean queries

Identify contributing factors in performance

Compare to Postgres tsearch facility

You might also like