Metadata and the web

Richard Sapon-White
March 18, 2013

 The growth of the Web
 Metadata in the context of the Web
 Important metadata schemes: XML, HTML, MARC

 From 1996-2007:
◦ 77,138 Web sites  125 million Web sites
 Access provided through search engines
◦ Google is the most used search engine
 Search engines use Web crawlers (a.k.a.,
spiders or robots) to collect information on web
sites
◦ Copy web pages and locations to build a catalog of
indexed pages

 =Invisible Web, Deep Web
 Web crawlers cannot:

◦ submit queries to databases,
◦ parse file formats that they do not recognize,
◦ click buttons on Web forms, or
◦ Log-in to sites requiring authentication
 Therefore, much of the information on the Web is
invisible!
◦ How much is invisible?
◦ Thousands of times larger than the indexed/visible web!!

• Topic Databases — subject-specific aggregations of information, such as SEC corporate filings, medical
databases, patent records, etc.
• Internal site — searchable databases for the internal pages of large sites that are dynamically created,
such as the knowledge base on the Microsoft site.
• Publications — searchable databases for current and archived articles.
• Shopping/Auction.
• Classifieds.
• Portals — broader sites that included more than one of these other categories in searchable databases.
• Library — searchable internal holdings, mostly for university libraries.
• Yellow and White Pages — people and business finders.
• Calculators — while not strictly databases, many do include an internal data component for calculating
results. Mortgage calculators, dictionary look-ups, and translators between languages are examples.
• Jobs — job and resume postings.
• Message or Chat .
• General Search — searchable databases most often relevant to Internet search topics and information.
From: Michael K. Bergman, "The Deep Web: Surfacing Hidden Value," Journal of Electronic Publishing 7, no.
1 (August 2001). http://www.press.umich.edu/jep/07-01/bergman.html.

 Poor site design results in invisible web sites
 To create web sites for human and machine

retrieval:
◦ Use hyperlinked hierarchies of categories
◦ Contribute Deep Web collections’ metadata to union
catalogs (which can then be indexed by search engines)
 Google’s Sitemap can provide detailed list of
pages on a site
 http://www.sitemaps.org/
 http://en.wikipedia.org/wiki/Help:Contents/Site_map

 Create conventional, MARC-based metadata
 Access via library catalogs, union catalogs
 Problems:
◦ Creating MARC records is labor-intensive, slow,
expensive
◦ Web sites are dynamic (content, URL’s), require MARC
records to be revised
 Solutions:
◦ Dublin Core
◦ META tags
◦ Resource Description Framework

  Dublin Core PowerPoint

 Embed 2 metadata elements in HTML <Head>
section of web page
◦ Keywords
◦ Description
 Example:
◦ <META NAME="KEYWORDS" CONTENT="data
standards, metadata, Web resources, World Wide Web,
cultural heritage information, digital resources, Dublin
Core, RDF, Semantic Web">
<META NAME="DESCRIPTION" CONTENT="Version
3.0 of the site devoted to metadata: what it is, its types
and uses, and how it can improve access to Web
resources; includes a crosswalk.">

Metadata and the web

More Related Content

What's hot

Viewers also liked

Similar to Metadata and the web

More from Richard.Sapon-White

Metadata and the web