Richard Sapon-White
     March 18, 2013
 The growth of the Web
 Metadata in the context of the Web
 Important metadata schemes: XML, HTML, MARC
   From 1996-2007:
    ◦ 77,138 Web sites  125 million Web sites
   Access provided through search engines
    ◦ Google is the most used search engine
   Search engines use Web crawlers (a.k.a.,
    spiders or robots) to collect information on web
    sites
    ◦ Copy web pages and locations to build a catalog of
      indexed pages
 =Invisible Web, Deep Web
 Web crawlers cannot:

    ◦   submit queries to databases,
    ◦   parse file formats that they do not recognize,
    ◦   click buttons on Web forms, or
    ◦   Log-in to sites requiring authentication
   Therefore, much of the information on the Web is
    invisible!
    ◦ How much is invisible?
    ◦ Thousands of times larger than the indexed/visible web!!
•  Topic Databases — subject-specific aggregations of information, such as SEC corporate filings, medical
   databases, patent records, etc.
•  Internal site — searchable databases for the internal pages of large sites that are dynamically created,
   such as the knowledge base on the Microsoft site.
•  Publications — searchable databases for current and archived articles.
•  Shopping/Auction.
•  Classifieds.
•  Portals — broader sites that included more than one of these other categories in searchable databases.
•  Library — searchable internal holdings, mostly for university libraries.
•  Yellow and White Pages — people and business finders.
•  Calculators — while not strictly databases, many do include an internal data component for calculating
   results. Mortgage calculators, dictionary look-ups, and translators between languages are examples.
•  Jobs — job and resume postings.
•  Message or Chat .
•  General Search — searchable databases most often relevant to Internet search topics and information.
From: Michael K. Bergman, "The Deep Web: Surfacing Hidden Value," Journal of Electronic Publishing 7, no.
   1 (August 2001). http://www.press.umich.edu/jep/07-01/bergman.html.
 Poor site design results in invisible web sites
 To create web sites for human and machine

  retrieval:
    ◦ Use hyperlinked hierarchies of categories
    ◦ Contribute Deep Web collections’ metadata to union
      catalogs (which can then be indexed by search engines)
   Google’s Sitemap can provide detailed list of
    pages on a site
      http://www.sitemaps.org/
      http://en.wikipedia.org/wiki/Help:Contents/Site_map
 Create conventional, MARC-based metadata
 Access via library catalogs, union catalogs
 Problems:
    ◦ Creating MARC records is labor-intensive, slow,
      expensive
    ◦ Web sites are dynamic (content, URL’s), require MARC
      records to be revised
   Solutions:
    ◦ Dublin Core
    ◦ META tags
    ◦ Resource Description Framework
    Dublin Core PowerPoint
   Embed 2 metadata elements in HTML <Head>
    section of web page
    ◦ Keywords
    ◦ Description
   Example:
    ◦ <META NAME="KEYWORDS" CONTENT="data
      standards, metadata, Web resources, World Wide Web,
      cultural heritage information, digital resources, Dublin
      Core, RDF, Semantic Web">
      <META NAME="DESCRIPTION" CONTENT="Version
      3.0 of the site devoted to metadata: what it is, its types
      and uses, and how it can improve access to Web
      resources; includes a crosswalk.">
Metadata and the web

Metadata and the web

  • 1.
    Richard Sapon-White March 18, 2013
  • 2.
     The growthof the Web  Metadata in the context of the Web  Important metadata schemes: XML, HTML, MARC
  • 3.
    From 1996-2007: ◦ 77,138 Web sites  125 million Web sites  Access provided through search engines ◦ Google is the most used search engine  Search engines use Web crawlers (a.k.a., spiders or robots) to collect information on web sites ◦ Copy web pages and locations to build a catalog of indexed pages
  • 4.
     =Invisible Web,Deep Web  Web crawlers cannot: ◦ submit queries to databases, ◦ parse file formats that they do not recognize, ◦ click buttons on Web forms, or ◦ Log-in to sites requiring authentication  Therefore, much of the information on the Web is invisible! ◦ How much is invisible? ◦ Thousands of times larger than the indexed/visible web!!
  • 5.
    • TopicDatabases — subject-specific aggregations of information, such as SEC corporate filings, medical databases, patent records, etc. • Internal site — searchable databases for the internal pages of large sites that are dynamically created, such as the knowledge base on the Microsoft site. • Publications — searchable databases for current and archived articles. • Shopping/Auction. • Classifieds. • Portals — broader sites that included more than one of these other categories in searchable databases. • Library — searchable internal holdings, mostly for university libraries. • Yellow and White Pages — people and business finders. • Calculators — while not strictly databases, many do include an internal data component for calculating results. Mortgage calculators, dictionary look-ups, and translators between languages are examples. • Jobs — job and resume postings. • Message or Chat . • General Search — searchable databases most often relevant to Internet search topics and information. From: Michael K. Bergman, "The Deep Web: Surfacing Hidden Value," Journal of Electronic Publishing 7, no. 1 (August 2001). http://www.press.umich.edu/jep/07-01/bergman.html.
  • 6.
     Poor sitedesign results in invisible web sites  To create web sites for human and machine retrieval: ◦ Use hyperlinked hierarchies of categories ◦ Contribute Deep Web collections’ metadata to union catalogs (which can then be indexed by search engines)  Google’s Sitemap can provide detailed list of pages on a site  http://www.sitemaps.org/  http://en.wikipedia.org/wiki/Help:Contents/Site_map
  • 7.
     Create conventional,MARC-based metadata  Access via library catalogs, union catalogs  Problems: ◦ Creating MARC records is labor-intensive, slow, expensive ◦ Web sites are dynamic (content, URL’s), require MARC records to be revised  Solutions: ◦ Dublin Core ◦ META tags ◦ Resource Description Framework
  • 8.
     Dublin Core PowerPoint
  • 9.
    Embed 2 metadata elements in HTML <Head> section of web page ◦ Keywords ◦ Description  Example: ◦ <META NAME="KEYWORDS" CONTENT="data standards, metadata, Web resources, World Wide Web, cultural heritage information, digital resources, Dublin Core, RDF, Semantic Web"> <META NAME="DESCRIPTION" CONTENT="Version 3.0 of the site devoted to metadata: what it is, its types and uses, and how it can improve access to Web resources; includes a crosswalk.">