Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
Scalable Boolean
Text Search
(Compare with R & G 27.13)
E.g.
E.g.
E.g.
E.g.
IR vs. DBMS
Seem like very different beasts
IR
DBMS
Imprecise Semantics
Precise Semantics
Keyword search
SQL
Structured data
Text Indexes
When IR folks say text index
usually mean more than what DB people
mean
In our terms, both tables and indexes
Really a logical schema (i.e. tables)
With a physical schema (i.e. indexes)
Usually not stored in a DBMS
Tables implemented as files in a file system
Term
data
database
date
day
dbms
decision
demonstrate
description
design
desire
developer
differ
disability
discussion
division
do
document
document
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
microsoft
microsoft
midnight
midterm
minibase
million
monday
more
most
ms
msn
must
necessary
need
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www.microsoft.com
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www.microsoft.com
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
http://www-inst.eecs.berkeley.edu/~cs186
An Inverted File
Snippets from:
Old class web
page
Old microsoft.com
home page
Search for
databases
microsoft
docID
Query Optimization
and Execution
Relational Operators
The Query
Buffer Management
Disk Space Management
Concurrency
and
Recovery
Needed
DB
DBMS
Ranking Algorithm
Buffer ManagementOS
Disk Space Management
DB
SearchEngine
Simple
DBMS
Summary
IR & Relational systems share basic building blocks for
scalability
IR internal representation is relational!
Equality indexes (B-trees)
Iterators
Join algorithms, esp. merge-join
Join ordering and selectivity estimation
IR constrains queries, schema, promises on semantics
Affects storage format, indexing and concurrency control
Affects join algorithms & selectivity estimation
IR has different performance goals
Ranking and best answers fast
Many challenges in IR related to text engineering
But dont tend to change the scalability infrastructure
Exercise!
Implement Boolean search as described in Postgres
Using the schemas and indexes here.
Write a simple script to load files.
You can ignore stemming and stop-words.