Effective Web Searching
T.B. Rajashekar National Centre for Science Information Indian Institute of Science Bangalore - 560 012 (E-Mail: raja@ncsi.iisc.ernet.in)
Effective Web Searching
How we use libraries and IR systems? Organization of the web
Accessing web-based information: key problems
Tools for Information retrieval on the web Directories/ guides Search engines Meta search tools People finding tools Strategies for web searching
Guides to search tools
Keeping current
T.B. Rajashekar November 2000 2
How we use Libraries and IR Systems?
Libraries: How the documents are organised document types, classification system used Access tools catalogues, indexes, automated catalogues, access points Our information need (search topic) translate these in terms of organization scheme employed by the library Information Retrieval systems (e.g. bibliographic
databases)
How the database is organised, record content, fields, search elements Indexing and query language, thesaurus, Boolean logic, truncation, etc. Our information need formulated as a search expression using the query language
T.B. Rajashekar November 2000 3
Organization of the Web
Adopt same strategy while searching the Web Understand web information architecture Understand the information access tools and the information access mechanisms they provide Represent our query in terms of mechanisms supported by these tools and search the web Web sites: How the content is organised (document types, structuring and navigation) Searchable/indexable and non searchable/indexable content Structure of web pages Meta tags, page attributes (properties)
T.B. Rajashekar
November 2000
Organization of the Web...
Web is the totality of web pages stored on web servers Spectacular growth in web-based information sources
and services:
Education and research Entertainment Business and commerce Personal home pages
Estimated to contain over 1 billion indexable web pages Doubling each year Over 80 million web sites
T.B. Rajashekar
November 2000
Accessing Web-based Information: Key Problems
Identification of sources (documents)
No central card catalog Most web pages are not indexed in standard
vocabulary, unlike library catalogues or journal article indexes Impossible to reach all related pages/ sites directly Need to use intermediate, resource finding tools
T.B. Rajashekar
November 2000
Information Retrieval on the Web
How to find relevant documents on the Web? Informal: Browsing (and book marking for later use) Friends Print sources Discussion forums (mailing lists) Current awareness services (e.g. Scout Report) Guessing web site addresses! Formal (using information finding tools) Web directories/ guides Web search engines Meta-search tools Specialty search engines
T.B. Rajashekar November 2000 7
Web Directories/ Guides
Also called as virtual libraries and Internet resource
catalogues Organised collection of descriptions and links to Internet sources Organisation: by subject categories (hierarchical); by resource type (patents, e-journals, institutes, etc.) Most use human experts for source selection, indexing and classification Some include reviews/ ratings of listed sites
T.B. Rajashekar
November 2000
Web Directories/ Guides...
Examples of general web directories: Librarians Index to the Internet (www.lii.org) Britannicas Webs best sites (www.britannica.com) Infomine (infomine.ucr.edu) Scout Report Signpost (www.signpost.org) BUBL link (bubl.ac.uk/link) Yahoo (www.yahoo.com) Magellan (www.mckinley.com) Galaxy (www.galaxy.com) Looksmart (www.looksmart.com) Snap (www.snap.com) New directory (October 2001): JoeAnt (www. joeant.com)
T.B. Rajashekar November 2000 9
Web Directories/ Guides...
Guides to directories: WWW Virtual Library (www.vlib.org) Argus Clearinghouse (www.clearinghouse.net) Gogettem (www.gogettem.com/) Subject-specific guides (subject gateways): Edinburgh Engineering Virtual Library (www.eevl.ac.uk) Social Science Information Gateway (sosig.ac.uk) The Internet Pilot To Physics (physicsweb.org/TIPTOP) Chemcenter (www.acs.com) Programmers Heaven (www.programmersheaven.com) Resource type guides: Patents (www.european-patent-office.org) Electronic journals (www.publist.com)
T.B. Rajashekar November 2000 10
Web Directories/ Guides...
Most web directories support searching within
categories and descriptions, in addition to browsing Advantages:
Access to high quality sources Do not contain redundant links Faster access to sources
Disadvantages: One needs to be aware of such directories/ guides May not be up-to-date May not be exhaustive Categories (subject hierarchy) varies across directories
T.B. Rajashekar
November 2000
11
Web Directories/ Guides...
When to use web directories/ guides?
For broad/ general topics where keyword searching on search engines retrieves too many irrelevant sites When you want a few highly relevant sites and intention is not exhaustive/ comprehensive search
When not to use web directories/ guides?
For concept/ keyword searches
Search terms are distinctive
Effective directory/ guide usage:
Take advantage of the sub-search within categories, supported by most directories/ guides Join their mailing lists for automatic updates on new sites
T.B. Rajashekar November 2000 12
Web Directories/ Guides...
Demonstration of directories/ guides: Librarians Index to the Internet (www.lii.org) Britannicas Webs best sites (www.britannica.com) Scout Report Signpost (www.signpost.org) BUBL link (bubl.ac.uk/link) Yahoo (www.yahoo.com) WWW Virtual Library (www.vlib.org) Argus Clearinghouse (www.clearinghouse.net)
T.B. Rajashekar
November 2000
13
Web Search Engines
Just as A&I journals index published literature, web
search engines build a full-text index to web pages gathered from web sites and provide a keyword search interface to this index Spider programs periodically visit web sites and gather the web pages for indexing Also index web sites submitted by site developers A brief summary of the indexed web page is also prepared The index usually contains URLs, titles, headings, and other words from the HTML document
November 2000 14
T.B. Rajashekar
Web Search Engines...
The search engines provide a forms-based search
interface for entering the queries Support simple and advanced search interfaces Search results are returned in the form of a list of web sites matching the query Some key features supported:
Phrase searching ( double quotes) Boolean searching (AND, OR, NOT) Implied Boolean: Term inclusion (+), term exclusion (-)
T.B. Rajashekar
November 2000
15
Web Search Engines
Key features Proximity searches (NEAR, ADJ, BEFORE, AFTER) Use of parentheses to group search terms Truncation searches (industr*) Field-specific searching (Title, URL, Text) Natural language queries (Why is the sky blue?) Relevance ranking of search results Number of search terms Number of times each search term occurs Proximity of search terms Location of search terms (title, text)
T.B. Rajashekar
November 2000
16
Web Search Engines
Key features Sub-searching (searching within retrieved records) Case sensitivity Limit by language Limit by age of documents Limit by audio, video and image type Translation of search results (title and description) Limit by domain, host
T.B. Rajashekar
November 2000
17
Web Search Engines...
Examples: Fastsearch (alltheweb.com)
Altavista (www.altavista.com)
Google (www.google.com) Northernlight (www.northernlight.com) HotBot (www.hotbot.com) Excite (www.excite.com)
New search engines (October 2001)
Teoma (http://www.teoma.com/)
Wisenut (http://www.wisenut.com/)
T.B. Rajashekar
November 2000
18
Web Search Engines...
Specialty search engines: Country-specific search engines www.khoj.com www.123india.com Subject-specific search engines Chemfinder (www.chemfinder.com) Engineering Resources Online (www.er-online.co.uk) MathSearch (www.maths.usyd.edu.au:8000/MathSearch.html) Netpart: Company site locator (www.websense.com/locator.cfm) World Trade Locator (www.intl-tradenet.com) Resource-specific search engines: Patents (www.uspto.gov) Journal articles (www.findarticles.com)
T.B. Rajashekar November 2000 19
Web Search Engines...
Example tutorials Lets look at couple of tutorials which present and compare the features of major search engines (use local copies if cannot connect) Finding Information on the Internet: A tutorial (www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo .html) How to search the world wide web: A tutorial for beginners and non-experts. David P. Habib and Robert L. Balliot. September, 1999 (204.17.98.73/midlib/tutor.htm)
T.B. Rajashekar
November 2000
20
Web Search Engines...
Advantages of search engines:
Best suited for complex keyword/ concept searches Control over search: search terms can be combined as required Searches can be limited to period of time, fields, source type,etc. Currency of information, made possible by regular addition by web spiders Exhaustive information can be retrieved (with lots of patience!)
Disadvantages:
Time consuming False positives Search engines vary in terms of search techniques/ syntax
Dead links, redundant links (same document gets displayed)
Spamming (salting of pages) Higher ranking of paying sites
T.B. Rajashekar November 2000 21
Web Search Engines...
Limitations of web search engines: Poor retrieval effectiveness (relevance) as little vocabulary control is exercised by web site developers and the index engines Different search engines return different search results due to the variation in indexing and search process (40% nonoverlap) None of the search engines come close to indexing the entire web, much less the entire Internet. Content not indexed: PDF documents Content that requires log in Databases searched using CGI programs Web content on intranets behind fire walls
T.B. Rajashekar November 2000 22
Web Search Engines...
Limitations of web search engines: Limited support for field-based searching (limitation lies mostly with HTML itself) Poor support for search using META tag fields
T.B. Rajashekar
November 2000
23
Web Search Engines...
Demonstration of search engines: Fastsearch (www.alltheweb.com) Altavista (www.altavista.com) Google (www.google.com) Northernlight (www.northernlight.com)
T.B. Rajashekar
November 2000
24
Meta Search Tools
Exhaustive searches require use of more than one web search engine
and familiarity with their search interface
Meta search tools provide a common interface and conduct searches
in many search engines simultaneously and return results in a uniform format
Do not gather web pages, build indexes, accept URL additions,
classify or review web sites
Some features supported:
Duplicate hits removal Rank results Selection of search engine(s) to be used
T.B. Rajashekar
November 2000
25
Meta Search Tools...
Search using multiple search engines
Search using a meta search tool
T.B. Rajashekar
November 2000
26
Meta Search Tools...
Meta search tools (remote sites):
MetaCrawler (www.metacrawler.com) Ixquick (www.ixquick.com) Dogpile (www.dogpile.com) ProFusion (www.profusion.com)
Meta search tools (local, installable software):
Copernic (www.copernic.com) SearchPad (www.searchpad.com)
LexiBot (www.completeplanet.com)
T.B. Rajashekar
November 2000
27
Meta Search Tools...
Advantages: Query can be run across multiple search engines User needs to learn only the search interface of the meta search tool Better results: retrieves top-ranking pages from individual search engines Disadvantages: Unique features of individual search engines is lost Not exhaustive: use only top results returned by search engines
T.B. Rajashekar
November 2000
28
Meta Search Tools...
When to use meta search tools? Need to be used cautiously Good for simple searches, particularly if search terms are distinctive or unique Good for testing with a few keywords and find which individual search engine returns good results Good for quick and dirty searching if you are in a hurry and want to find a few relevant sites quickly For complex searches, involving many search terms, Boolean logic, etc., it is better to use individual search engines
T.B. Rajashekar
November 2000
29
Meta Search Tools...
Demonstration:
MetaCrawler (www.metacrawler.com) Ixquick (www.ixquick.com) Dogpile (www.dogpile.com)
ProFusion (www.profusion.com)
T.B. Rajashekar
November 2000
30
People Finding Tools
Register names and addresses and find e-mail addresses Examples: Bigfoot (www.bigfoot.com) Peoplesearch (www.peoplesearch.net) Ahoy (ahoy.cs.washington.edu:6060/) Four11 (www.four11.com) Switchboard (www.switchboard.com) Whowhere (www.whowhere.lycos.com/) Most search engines also support people searches (e.g.
Altavista, Google, Yahoo!)
T.B. Rajashekar
November 2000
31
People Finding Tools
Using people finding tools: Person should have registered in the tool(s) Searcher should know both surname and first name, else too many names will be retrieved Bias for U.S. based people Often, required e-mail cannot be retrieved through these tools Alternatively, any search engine may be used (phrase search using persons name) If persons affiliation is known, Yahoo! Directory may be used to locate the institution and e-mail
T.B. Rajashekar
November 2000
32
Web Search Strategies
Search steps:
1. Analyze the search topic and identify the search terms (both inclusion and exclusion), their synonyms (if any), phrases and Boolean relations (if any) 2. Select the search tool(s) to be used (meta search engine, directory, general search engine, specialty search engine)
3. Translate the search terms into search statements of the selected search engine
4. Perform search 5. Refine the search based on results 6. Visit the actual site(s) and save the information (using FileSave option of the browser)
T.B. Rajashekar November 2000 33
Web Search Strategies
Tips for effective web searching:
Broad or general concept searches: start with directory-based services (want a few highly relevant sites for a broad topic) Highly specific or topics with unique terms/ many concepts: use the search tools Go through the help pages of search tools carefully Gather sufficient information about the search topic before searching
Spelling variations, synonyms, broader and narrower terms
Use specific keywords, rare/unusual words are better than common ones
T.B. Rajashekar November 2000 34
Web Search Strategies...
Tips for effective web searching
Prefer phrase & adjacency searching to Boolean (stuffed animal than stuffed and animal) Use as many synonyms as possible - search engines use statistical retrieval methods and produce better results with more query words Avoid use of very common words (e.g., computer) Enter search terms in lower case. Use upper case to force exact match (e.g. Light Combat Aircraft, LCA)
Use More like this option, if supported by the search engine (e.g. Excite, Google)
T.B. Rajashekar
November 2000
35
Web Search Strategies...
Tips for effective web searching
Repeat the search by varying search terms and their combinations; try this on different search tools Enter most important terms first - some search tools are sensitive to word order
Use the NOT operator to exclude unwanted pages (e.g.: biodata, resumes, courses)
Go through at least 5 pages of search results before giving up the scan
Select 2 or 3 search tools and master the search techniques
T.B. Rajashekar
November 2000
36
Sample Web Searches
Companies dealing with polymers Do not use search engines (too many irrelevant hits) Use directory sources (e.g. www.yahoo.com) Follow the categories: Business and Economy Business-to-Business Chemicals Do a sub-search on Polymers Use specialty search engines (e.g. www.bizweb.com)
T.B. Rajashekar
November 2000
37
Sample Web Searches...
Web pages related to Light Combat Aircraft Keywords are unique Use Search Tools (e.g. www.altavista.com) Search for Light Combat Aircraft (phrase search in simple search interface) Use of double quotes will force the search engine to consider the set of keywords as a phrase Search can be limited to specific dates More refined search in advanced search interface: Light Combat Aircraft AND India
T.B. Rajashekar
November 2000
38
Sample Web Searches...
Web sources related to simulation or modeling of
activated sludge process
This is a concept search - search tools are better Using Altavista, the query may be submitted as (simulat* OR model*) AND activated sludge process Note use of * to cover word variations like simulated, simulate, models, etc. Note use of phrase form for activated sludge process
T.B. Rajashekar
November 2000
39
Guides to Search Tools
www.beaucoup.com (guide to 2,000+ search engines,
indices and directories) www.searchpower.com (a very comprehensive search engine directory - claims over 16,000 search engine listings!) www.123go.com/drw/search/search.htm (Dr. Websters Big Page of Search Engines ) www.finderseeker.com (The search engine of search engines) www.virtualfreesites.com (Over 1,000 specialised search engines)
November 2000 40
T.B. Rajashekar
Keeping Current
AskScott (www.askscott.com): Provides a very
comprehensive tutorial on search engines SearchEngineWatch (www.searchenginewatch.com) The site offeres information about new developments in search engines and provides reviews and tutorials. Botspot (www.botspot.com): Collection and guide to variety of bots (intelligent agents)
T.B. Rajashekar
November 2000
41
raja@ncsi.iisc.ernet.in
T.B. Rajashekar
November 2000
42