0% found this document useful (0 votes)
269 views59 pages

Information Retrieval Techniques

The document provides an overview of Information Retrieval (IR), detailing its history, techniques, and the evolution of search systems, particularly with the advent of the web. It discusses the processes of indexing, retrieval, and ranking, as well as the challenges faced in modern IR systems, including user interaction and query formulation. Additionally, it highlights the impact of web search on IR practices and the importance of user interfaces in facilitating effective information seeking.

Uploaded by

yghg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
269 views59 pages

Information Retrieval Techniques

The document provides an overview of Information Retrieval (IR), detailing its history, techniques, and the evolution of search systems, particularly with the advent of the web. It discusses the processes of indexing, retrieval, and ranking, as well as the challenges faced in modern IR systems, including user interaction and query formulation. Additionally, it highlights the impact of web search on IR practices and the importance of user interfaces in facilitating effective information seeking.

Uploaded by

yghg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Information Retrieval

Techniques
Information Retrieval (IR)
• IR deals with the representation, storage, organization
of and success to information items
• Types of information items: documents, web pages, online
catalogs, structured records, multimedia objects
• Early goals of IR : indexing text and searching for useful
documents in a collections
• Recent research in IR includes:
• Modelling, Web search, text classification, systems architecture, user
interfaces, data visualization, filtering and languages
Early developments
• For more than 5,000 years, man has organized information for
later retrieval and searching.
• This has been done by compiling, storing, organizing, and indexing
papyrus, hieroglyphics and books
• For holding the various items, special purpose buildings called
libraries or bibliothekes are used
• Nowadays, libraries are everywhere
• In 2008, more than 2 billion items were checked out from libraries in the
US – an increase of 10% over the previous year
• Since the volume of information in libraries is always growing. It is
necessary to build specialized data structures for fast search – Indexes
• For centuries indexes have been created manually as set of categories,
with labels associated with each category
• The advent of modern computers has allowed the construction of large
indexes automatically
Early developments in IR
• Hans Peter Luhn, Eugene Garfield, Philip Bagley, and Calvin Moores
coined the term information retrieval
• Cyril Cleverdon published the cranfield studies on retrieval
evaluation
• Joseph Becker and Robert Hayes published the first book on IR
• Karen Sparck Jones and Gerard Slaton - TF-IDF term weighting
scheme (term-frequency-inverse document frequency –statistical
measure used to evaluate how important a word is to a document in
a collection or corpus
• Jardine and van Rijsbergen articulated the cluster hypothesis
• ACM SIGIR international conference on IR in 1978
• Van Rijsbergen published a classic book entitled Information retrieval
– Probabilistic Model
• Salton and McGill published a classic book “Introduction to Modern
Information Retrieval – Vector Model
Libraries and Digital Libraries
• Libraries were among the first institutions to adopt IR
systems for retrieving information
• Initially, such systems consisted of an automation of
existing processes such as card catalogs searching
• Increased search functionality was then added – subject
headings, keywords, query operators
• Nowadays, the focus has been on improved graphical
interfaces, electronic forms, hypertext features.
IR at the Center of the Stage
• Until recently, IR was an area of interest restricted mainly to
librarians and information experts
• A single fact changed these perceptions-the introduction of
the web, which has become the largest repository of
knowledge in human history
• Due to its enormous size, finding useful information on the
web usually requires running a search
• And Searching on the Web is all about IR and its
technologies
• IR has gained a place with other technologies at the center
of the stage
IR problem
• Users of modern IR systems, such as search engine users have
information needs of varying complexity
• Ex : Find all documents that address the role of the Federal Government in
financing the operation of the National Railroad Transportation Corporation
(AMTRAK)
• The full description of the user information need is not necessarily a good
query to be submitted to the IR system
• Instead, the user might want to first translate this information need into a
query
• This translation process yields a set of keywords or index terms, which
summarize the user information need.
• Given the user query, the key goal of the IR system is to retrieve information
that is useful or relevant to the user
• The IR system must rank the information items according to the degree of
relevancy to the user query
• IR problem – key goal of an IR system is to retrieve all the items that are
relevant to a user query, while retrieving as few non-relevant items as
possible
• Relevance is the central importance in IR
The User’s Task
• A user who seeks information on a topic of their
interest
• This user first translates their information need into a query,
which requires specifying the words that compose the query
• In this case, the user is searching or querying for information
of their interest.
• A user who has an interest that is either poorly defined
or inherently broad
• For instance, the user has an interest in car racing and wants
to browse documents on Formula 1 and Formula Indy
• In this case, the user is browsing or navigating the documents
of the collection
The User’s Task
Information Vs Data Retrieval
IR System
• Assemble document collection and store it in central
repository can be private or be crawled from the web.
• Documents in central repository need to be indexed for
fast retrieval (index structure: inverted index )
• Retrieval / Searching Process can be initiated
• User gives query reflects their information need
• User query is converted into system query by parsing and
expanding
• Processed against the index to retrieve subset of all
documents
• Retrieved documents are ranked and top ranked documents
are returned to user.
IR System

To improve the IR system – Evaluation done


Evaluation Procedure - comparing the set of results produced by the
IR
system with the results suggested by human specialist
To improve ranking – Collect feedback from the user and use this to
change the results
The Retrieval and Ranking Processes
• The process of indexing , retrieval and ranking
• User Interface manages interaction with the user:
• Query input and document output.
• Relevance feedback.
• Visualization of results.
• Indexing Process - constructs an inverted index of word
to document pointers
• Retrieval Process - retrieves documents that contain a
given query token from the inverted index
• Ranking Process - assign score to all retrieved
documents according to a relevance metric
The Retrieval and Ranking Processes
Indexing Process and Retrieval
Process
• Text Transformation / operation forms index words
(tokens)
• Stop-word removal (words which are filtered out
before or after processing of natural language data )

• Stemming (basically removing the suffix from a word


and reduce it to its root word, Ex. In “Flying” – “ing” is
removed and “Fly” root word)
Indexing Process and Retrieval
Process
• Stop-word
• To reduce the set of representative keywords from large
collection a", "and", "but", "how", "or“
• For example, "What is a motherboard?“ "motherboard" .
• The removal of stop words usually improves IR effectiveness.
• Stop-list: contain stop-words, not to be used as index
• Prepositions, Articles, Pronouns, Some adverbs and adjectives,
Some frequent words (e.g. document)
• The removal of stop-words usually improves IR effectiveness
Indexing Process and Retrieval
Process
• Stemming
• Different word forms may bear similar meaning (e.g. search, searching): create a
“standard” representation for them
• Stemming Ex:
• Which reduces distinct words to their common grammatical root Removing some
endings of word

computer
compute
computes
computing comput
computed
computation
Changes in Web
• Web search is the most prominent application of IR and its
techniques – the ranking and indexing components of any
search engine are fundamentally IR pieces of technology
• First impact – features of the document collection itself
• The web is composed of pages distributed over millions of sites and
connected through hyperlinks
• This requires collecting all documents and storing copies of them in
a central repository, before indexing
• This new phase in the IR process, introduced by the Web – Crawling
• Second impact
• The size of the collection
• The volume of user queries submitted daily
• As a consequence, performance and scalability have become critical
characteristics of the IR system
Changes in Web
• Third major impact in a very large collection, predicting
relevance is much harder than before
• Fortunately, the Web also includes new sources of evidence
• Ex: hyperlinks and user clicks in documents in the answer set
• The fourth major impact derives from the fact that the Web is
also a medium to do business
• Search problem has been extended beyond the seeking of text
information to also encompass other user needs
• Ex: the price of a book, the phone number of a hotel, the link for
downloading a software
• Fifth major impact of the Web on search is Web spam
• Web spam: abusive availability of commercial information disguised
in the form of informational content
• This difficulty is so large that today we talk of Adversarial Web
Retrieval
Practical Issues on the Web
• Security
• Commercial transactions over the Internet are not yet a completely
safe procedure
• Privacy
• Frequently, people are willing to exchange information as long as it
does not become public
• Copyright and patent rights
• It is far from clear how the wide spread of data on the Web affects
copyright and patent laws in the various countries
• Scanning, optical character recognition (OCR), and cross-
language retrieval
Search
• User Interfaces for search focuses on
• the human users of search systems
• the search user interface, i.e., the window through which search
systems are seen
• The user interface role is to aid in the searchers’ understanding and
expression of their information need
• Further, the interface should help users
• formulate their queries
• select among available information
• sources understand search results
• keep track of the progress of their search
• User interaction with search interfaces differs depending on
• the type of task
• the domain expertise of the information seeker
• the amount of time and effort available to invest in the process
Search
Information lookup and exploratory search

Information lookup tasks

• are akin to fact retrieval or question answering

• can be satisfied by discrete pieces of information:


numbers, dates, names, or Web sites

• can work well for standard Web search interactions


Search
• Exploratory search is divided into learning and investigating
tasks
• Learning search
• requires more than single query-response pairs
• requires the searcher to spend time
• scanning and reading multiple information items
• synthesizing content to form new understanding
• Investigating refers to a longer-term process which
• involves multiple iterations that take place over perhaps very long
periods of time
• may return results that are critically assessed before being
integrated into personal and professional knowledge bases
• may be concerned with finding a large proportion of the relevant
Search
• Information seeking can be seen as being part of a larger
process referred to as sensemaking
Sensemaking is an iterative process of formulating a
conceptual representation from a large collection
• Russell et al. observe that most of the effort in sensemaking
goes towards the synthesis of a good representation
• Some sensemaking activities interweave search throughout,
while others consist of doing a batch of search followed by a
batch of analysis and synthesis
• Examples of deep analysis tasks that require sensemaking
• the legal discovery process
• epidemiology (disease tracking)
• studying customer complaints to improve service
• obtaining business intelligence
Search
• Classic X Dynamic Model
• Classic notion of the information seeking process
1. problem identification

2. articulation of information need(s)

3. query formulation

4. results evaluation
Search
Recent models emphasize the dynamic nature of the search
process
• The users learn as they search
• Their information needs adjust as they see retrieval results
and other document surrogates
• This dynamic process is sometimes referred to as the berry
picking model of search
• The rapid response times of today’s Web search engines
allow searchers:
• to look at the results that come back
• to reformulate their query based on these results
Sometimes it is referred to as
Search
• Jansen et al made an analysis of search logs and found that the
proportion of users who modified queries is 52%

• Some seeking models cast the process in terms of strategies and


how choices for next steps are made

• In some cases, these models are meant to reflect conscious


planning behavior by expert searchers

• In others, the models are meant to capture the less planned,


potentially more reactive behavior of a typical information seeker
Search
Navigation x Search
• Navigation: the searcher looks at an information structure and browses
among the available information
• This browsing strategy is preferrable when the information structure is well-
matched to the user’s information need
• it is mentally less taxing to recognize a piece of information than it is to recall
it
• it works well only so long as appropriate links are available
• If the links are not available, then the browsing experience might be
frustrating
• Spool discusses an example of a user looking for a software driver for a
particular laser printer
• Say the user first clicks on printers, then laser printers, then the following
sequence of links:
• HP laser printers
• HP laser printers model 9750
• software for HP laser printers model 9750
Search
• Search Process
• One common observation is that users often reformulate their
queries with slight modifications
• Another is that searchers often search for information that they
have previously accessed
• The users’ search strategies differ when searching over previously seen
materials
• Researchers have developed search interfaces support both query
history and revisitation
• Studies also show that it is difficult for people to determine whether
or not a document is relevant to a topic
• The less users know about a topic, the poorer judges they are of
whether a search result is relevant to that topic
• Other studies found that searchers tend to look at only the top-
ranked retrieved results
• Further, they are biased towards thinking the top one or two results
Search
• Search Process
• Studies also show that people are poor at estimating how much of
the relevant material they have found
• Other studies have assessed the effects of knowledge of the
search process itself
• These studies have observed that experts use different strategies
than novices’ searchers
• For instance, Tabatabai et al found that
• expert searchers were more patient than novices

• this positive attitude led to better search outcomes


Search Interfaces
• The information seeking session begin in online information
systems through the most common way to use a Web search
engine
• Another method is to select a Web site from a personal
collection of already-visited sites
• Which are typically stored in a browser’s bookmark
• Online bookmark systems are popular among a smaller
segment of users
• Web directories are also used as a common starting point,
but have been largely replaced by search engines
Search Interfaces
Query Specification
• The primary methods for a searcher to express their information
need are either
• entering words into a search entry form
• selecting links from a directory or other information organization display
• For Web search engines, the query is specified in textual form
• Typically, Web queries today are very short consisting of one to
three words
• Short queries reflect the standard usage scenario in which the
user tests the waters
• If the results do not look relevant, then the user reformulates their query
• If the results are promising, then the user navigates to the most
relevant-looking
This search behavior Web
is site
a demonstration of orienteering the strategy
Search Interfaces
• Before the Web, search systems regularly supported Boolean
operators and command-based syntax
• However, these are often difficult for most users to understand
• Jansen et al conducted a study over a Web log with 1.5M
queries, and found that
• 2.1% of the queries contained Boolean operator
• 7.6% contained other query syntax, primarily double-quotation marks
for phrases
• White et al examined interaction logs of nearly 600,000 users,
and found that
• 1.1% of the queries contained one or more operators
• 8.7% of the users used an operator at any time
Search Interfaces
• Web ranking has gone through three major phases
• In the first phase, from approximately 1994–2000
• Since the Web was much smaller then, complex queries were less likely
to yield relevant information
• Further, pages retrieved not necessarily contained all query words
• Around 1997, Google moved to conjunctive queries only
• The other Web search engines followed, and conjunctive ranking
became the norm
• Google also added term proximity information and page importance
scoring (PageRank)
• As the Web grew, longer queries posed as phrases started to produce
highly relevant results
Query Specification Interfaces
• The standard interface for a textual query is a search box
entry form
• Studies suggest a relationship between query length and
the width of the entry form
• Results found that either small forms discourage long queries or
wide forms encourage longer queries
• Some entry forms are followed by a form that filters the
query in some way
• For instance, at yelp.com, the user can refine the search by
location using a second form
Query Specification Interfaces

• Notice that the yelp.com form also shows the user’s home location, if it has
been specified previously
• Some search forms show hints on what kind of information should be
entered into each form
• For instance, in zvents.com search, the first box is labeled “what are you
looking for”?
Query Specification Interfaces

• The previous example also illustrates specialized input types that


some search engines are supporting today
• The zvents.com site recognizes that words like “tomorrow” are time-
sensitive
• It also allows flexibility in the syntax of dates
• To illustrate, searching for “comedy on wed ” automatically
computes the date for the nearest future Wednesday
• This is an example of how the interface can be designed to reflect how
people think
• Some interfaces show a list of query suggestions as the user
Query Specification Interfaces
• Anick et al found that users clicked on dynamic Yahoo
suggestions one third of the time
• Often the suggestions shown are those whose prefix
matches the
• characters typed so far
• However, in some cases, suggestions are shown that only
have interior letters matching
• Further, suggestions may be shown that are synonyms
of the words typed so far
• Dynamic query suggestions, from Netflix.com
Query Specification Interfaces

• The dynamic query suggestions can be derived from several


sources, including:
• The user’s own query history
• A set of metadata that a Web site’s designer considers important
• All of the text contained within a Web site
• Dynamic query suggestions, grouped by type, from
Query Specification Interfaces
Retrieval Results Display
• When displaying search results, either
• the documents must be shown in full, or else
• the searcher must be presented with some kind of representation of
the content of those documents
• The document surrogate refers to the information that
summarizes the document
• This information is a key part of the success of the search interface
• The design of document surrogates is an active area of research and
experimentation
• The quality of the surrogate can greatly effect the perceived relevance
of the search results listing
Retrieval Results Display
• In Web search, the page title is usually shown prominently, along with the
URL and other metadata
• In search over information collections, metadata such as date published
and author are often displayed
• Text summary (or snippet) containing text extracted from the document is
also critical
• Currently, the standard results display is a vertical list of textual
summaries
• This list is sometimes referred to as the SERP (Search Engine Results
Page)
• In some cases the summaries are excerpts drawn from the full text that
contain the query terms
• In other cases, specialized kinds of metadata are shown in addition to
standard textual results
Retrieval Results Display
• For example, a query on a term like “rainbow” may return
sample images as one entry in the results listing
Retrieval Results Display
• A query on the name of a sports team might retrieve
the latest game scores and a link to the match schedule
Retrieval Results Display
• Nielsen notes that in some cases the information need is
satisfied directly in the search results listing
• This makes the search engine an “answer engine”
• Displaying the query terms in the context in which they
appear in the document
• Improves the user’s ability to gauge the relevance of the results
• It is sometimes referred to as KWIC - keywords in context
• It is also known as query-biased summaries, query-oriented
summaries, or user-directed summaries
• The visual effect of query term highlighting can also improve
usability of search results listings
• Highlighting can be shown both in document surrogates in the
retrieval results and in the retrieved documents
Retrieval Results Display
• Determining which text to place in the summary, and how
much text to show, is a challenging problem
• Often the summaries contain all the query terms in close
proximity to one another
• However, there is a trade-off between
• Showing contiguous sentences, to aid in coherence in the result
• Showing sentences that contain the query terms
• Some results suggest that it is better to show full sentences
rather than cut them off
• On the other hand, very long sentences are usually not desirable
in the results listing
Retrieval Results Display

• Further, the kind of information to display should vary according to


the intent of the query
• Longer results are deemed better than shorter ones for certain types of
information need
• On the other hand, abbreviated listing is preferable for navigational queries

• Similarly, requests for factual information can be satisfied with a concise


results display
Retrieval Results Display
• The page results below show figures extracted from
journal articles alongside the search results
Query Reformulation
• There are tools to help users reformulate their query
• One technique consists of showing terms related to the query or
to the documents retrieved in response to the query
• A special case of this is spelling corrections or suggestions
• Usually only one suggested alternative is shown: clicking on that
alternative re-executes the query
• In years back, the search results were shown using the
purportedly incorrect spelling
Query Reformulation
• Term expansion: search interfaces are increasingly employing
related term suggestions
• Log studies suggest that term suggestions are a somewhat
heavily-used feature in Web search
• Jansen et al made a log study and found that 8% of queries were
generated from term suggestions
• Anick et al found that 6% of users who were exposed to term
suggestions chose to click on them
• Some query term suggestions are based on the entire search
session of the particular user
• Others are based on behavior of other users who have issued the
same or similar queries in the past
• One strategy is to show similar queries by other users
• Another is to extract terms from documents that have been clicked on in
the past by searchers who issued the same query
Query Reformulation
• Relevance feedback is another method whose goal is to aid
in query reformulation
• The main idea is to have the user indicate which documents are
relevant to their query
• In some variations, users also indicate which terms extracted from
those documents are relevant
• The system then computes a new query from this information
and shows a new retrieval set
• Nonetheless, this method has not been found to be successful
from a usability perspective
• Because that, it does not appear in standard interfaces today
• This stems from several factors
• People are not particularly good at judging document relevance,
especially for topics with which they are unfamiliar ; The beneficial
Organization Search Results
• Organizing results into meaningful groups can help users
understand the results and decide what to do next
• Popular methods for grouping search results: category
systems and clustering
• Category system: meaningful labels organized in such a way
as to reflect the concepts relevant to a domain
• Good category systems have the characteristics of being coherent
and relatively complete
• Their structure is predictable and consistent across search results
for an information collection
Organization Search Results
• The most commonly used category structures are flat, hierarchical, and faceted
categories

• Flat categories are simply lists of topics or subjects


• They can be used for grouping, filtering (narrowing), and sorting sets of documents in search interfaces

• Most Web sites organize their information into general categories


• Selecting that category narrows the set of information shown accordingly

• Some experimental Web search engines automatically organize results into flat
categories
• Studies using this kind of design have received positive user responses (Dumais et al, Kules et al)

• However, it can difficult to find the right subset of categories to use for the vast
content of the Web
Organization Search Results
• An alternative representation is the faceted metadata
• Unlike flat categories, faceted metadata allow the assignment of multiple categories to
a single item
• Each category corresponds to a different facet (dimension or feature type) of the
collection of items
• Clustering refers to the grouping of items according to some measure of similarity
• It groups together documents that are similar to one another but different from the rest
of the collection
• Such as all the document written in Japanese that appear in a collection of primarily English articles

• The greatest advantage of clustering is that it is fully automatable


• The disadvantages of clustering include
• an unpredictability in the form and quality of results
• the difficulty of labeling the groups
• the counter-intuitiveness of cluster sub-hierarchies
Visualization in Search Interfaces

• Visualizing Boolean syntax

• Visualizing query terms within retrieval results

• Visualizing relationships among words and documents

• Visualization for text mining


Visualization in Search Interfaces
• Visualizing Boolean syntax
• Boolean query syntax is difficult for most users and is rarely used in Web
search
• For many years, researchers have experimented with how to visualize
Boolean query specification
• A common approach is to show Venn diagrams
• A more flexible version of this idea was seen in the VQuery system,
proposed by Steve Jones
• The VQuery interface for Boolean query specification
Visualization in Search Interfaces
• Visualizing query terms within retrieval results
• Understanding the role of the query terms within the retrieved docs can help relevance

assessment

• Experimental visualizations have been designed that make this role more explicit

• TileBars interface
• Field-sortable search results view
• Colored TileBars view
• Another variation on the idea of showing query term hits within documents is to show thumbnails
• Textually enhanced thumbnails
Visualization in Search Interfaces
• Visualizing relationships among words and documents
Visual wordNet

You might also like