03Text Processing
03Text Processing
Information Retrieval
Lecture 3
However...
numbers do appear in user queries
single letters
URLs as index terms?
Some alternatives:
emit text appearing inside all or some tags
by the indexer.
programming index!
Lecture 3 Information Retrieval 11
Removing Stop Words
Start with a list of stop words
Table lookup
Make a table out of a static stoplist
Match each token against the table
Hashes, perfect hashing, tries
Build into the lexical analyzer (see F&BY)
Or take a statistical approach