SlideShare a Scribd company logo
 
Overview Do we need Semantic Web Crawlers? Current Features Crawler Architecture Crawler Configuration Applications and Future Extensions
Do We Need Semantic Web Crawlers? Increasing availability of distributed data Mirroring often only option for large sources Varying application needs Real-time retrieval not always necessary/desirable Personal metadata increasingly distributed Need a means to collate data Compiling large, varied datasets for research Triple store and query engine load testing
Introducing Slug Open Source multi-threaded web crawler Supports creation of crawler “profiles” Highly extensible Cache content in file system or database Crawl new content, or “freshen” existing data Generates RDF metadata for crawling activity Hopefully(!) easy to use, and well documented
 
Crawler Architecture Basic Java Framework Multi-threaded retrieval of resources via HTTP Could be used to support other protocols Extensible via RDF configuration file Simple Component Model Content processing and task filtering components Implement custom components for new behaviours Number of built-in behaviours e.g. Crawl depth limiting; URL blacklisting, etc
 
Component Model
Consumers Responsible for processing results of tasks Support for multiple consumers per profile RDFConsumer Parses content; Updates memory with triple count Discovers  rdfs:seeAlso  links; Submits new tasks ResponseStorer Store retrieved content in file system PersistentResponseStorer Store retrieved content in Jena persistent model
Task Filters Filters are applied before new Tasks accepted Support for multiple filters per profile Task must pass all filters to be accepted DepthFilter Rejects tasks that are beyond a certain “depth” RegexFilter Reject URLs that match a regular expression SingleFetchFilter Loop avoidance; remove previously encountered URLs
 
Scutter Profile A combination of configuration options Uses custom RDFS Vocabulary Current options: Number of threads Memory location Memory type (persistent, file system) Specific collection of Consumers and Filters Custom components may have own configuration
Example Profile <slug:Scutter rdf:about=&quot;default&quot;> <slug:hasMemory rdf:resource=&quot;memory&quot;/> <!-- consumers for incoming data --> <slug:consumers> <rdf:Seq> <rdf:li rdf:resource=&quot;storer&quot;/> <rdf:li rdf:resource=&quot;rdf-consumer&quot;/>   </rdf:Seq>   </slug:consumers> </slug:Scutter>
Example Consumer <slug:Consumer rdf:about=&quot;rdf-consumer&quot;> <dc:title>RDFConsumer</dc:title> <dc:description>Discovers seeAlso links in RDF models and adds them to task list</dc:description> <slug:impl>com.ldodds.slug.http.RDFConsumer</slug:impl> </slug:Consumer>
Sample Filter <slug:Filter rdf:about=&quot;depth-filter&quot;> <dc:title>Limit Depth of Crawling</dc:title> <slug:impl>com.ldodds.slug.http.DepthFilter</slug:impl> <!-- if depth >= this then url not included. Initial depth is 0 --> <slug:depth>3</slug:depth> </slug:Filter>
Sample Memory Configuration <slug:Memory rdf:about=&quot;db-memory&quot;> <slug:modelURI rdf:resource=&quot; http://www.example.com/test-model &quot;/> <slug:dbURL>jdbc:mysql://localhost/DB</slug:dbURL> <slug:user>USER</slug:user> <slug:pass>PASSWORD</slug:pass> <slug:dbName>MySQL</slug:dbName> <slug:driver>com.mysql.jdbc.Driver</slug:driver>  </slug:Memory>
 
Scutter Vocabulary Vocabulary for crawl related metadata Where have I been? What responses did I get? Where did I find a reference to this document? Draft Specification  by Morten Frederiksen Crawler automatically generates history Components can store additional metadata
Scutter Vocab Overview Representation “shadow resource” of a source document scutter:source  = URI of source document scutter:origin  = URIs which reference source Related to zero or more Fetches ( scutter:fetch ) scutter:latestFetch  = Most recent Fetch May be skipped because of previous error ( scutter:skip )
Scutter Vocab Overview Fetch Describes a  GET  of a source document HTTP Headers and Status dc:date scutter:rawTripleCount , included if parsed May have caused a  scutter:error  and a  Reason Reason Why was there an error?  Why is a  Representation  being skipped?
List Crawl History for Specific Representation PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX scutter: <http://purl.org/net/scutter/> SELECT ?date ?status ?contentType ?rawTripleCount WHERE { ?representation scutter:fetch ?fetch; scutter:source <http://www.ldodds.com/ldodds.rdf>. ?fetch dc:date ?date. OPTIONAL { ?fetch  scutter:status ?status. } OPTIONAL { ?fetch scutter:contentType ?contentType. } OPTIONAL { ?fetch scutter:rawTripleCount ?rawTripleCount. } } ORDER BY DESC(?date)
 
Working with Slug Traditional Crawling Activities E.g. Adding data to a local database Maintaining a local cache of useful data E.g. Crawl data using file system cache ...and maintain with “ -freshen ” Code for generating LocationMapper configuration Mapping the Semantic Web? Crawl history contains document relationships No need to keep content, just crawl...
Future Enhancements Support the Robot Exclusion Protocol Allow configuration of the User-Agent header Implement throttling on a global and per-domain basis Check additional HTTP status codes to &quot;skip&quot; more errors Support white-listing of URLs Expose and capture more statistics while in-progress
Future Enhancements Support Content Negotiation to negotiate data Allow pre-processing of data (GRDDL) Follow more than just  rdfs:seeAlso  links allow configurable link discovery Integrate a “smushing” utility Better manage persistent data Anything else?!
Questions? http://www.ldodds.com/projects/slug [email_address]
Attribution and Licence The following images were used in these slides http://flickr.com/photos/enygmatic/39266262/ http://www.flickr.com/photos/jinglejammer/601987 http://www.flickr.com/photos/sandyplotnikoff/105067900 Thanks to the authors! Licence for this presentation: Creative Commons Attribution-ShareAlike 2.5

More Related Content

PDF
Web Crawlers - Exploring the WWW
PDF
Design and Implementation of a High- Performance Distributed Web Crawler
PDF
Colloquim Report on Crawler - 1 Dec 2014
PPTX
Web crawler
PPT
WebCrawler
DOC
Web crawler synopsis
PPT
Web Crawler
PDF
Smart crawler a two stage crawler
Web Crawlers - Exploring the WWW
Design and Implementation of a High- Performance Distributed Web Crawler
Colloquim Report on Crawler - 1 Dec 2014
Web crawler
WebCrawler
Web crawler synopsis
Web Crawler
Smart crawler a two stage crawler

What's hot (20)

PDF
Intelligent web crawling
PPTX
REST meets Semantic Web
PPTX
Learning W3C Linked Data Platform with examples
PPTX
Introduction to Linked Data Platform (LDP)
PPT
W3C Linked Data Platform Overview
PPTX
LDP4j: A framework for the development of interoperable read-write Linked Da...
PPT
“Web crawler”
PDF
What is a web crawler and how does it work
PPT
Web crawler
PPTX
Web crawler with seo analysis
PPTX
Describing LDP Applications with the Hydra Core Vocabulary
PPTX
SemaGrow demonstrator: “Web Crawler + AgroTagger”
PPTX
Web crawler and applications
PDF
Web Crawling & Crawler
PDF
Colloquim Report - Rotto Link Web Crawler
PPTX
Do you need an external search platform for Adobe Experience Manager?
PPTX
Introduction to W3C Linked Data Platform
PPTX
Creating Truly RESTful APIs
PDF
Introduction to Apache Solr
PPTX
System Update (2011 CrossRef Workshops)
Intelligent web crawling
REST meets Semantic Web
Learning W3C Linked Data Platform with examples
Introduction to Linked Data Platform (LDP)
W3C Linked Data Platform Overview
LDP4j: A framework for the development of interoperable read-write Linked Da...
“Web crawler”
What is a web crawler and how does it work
Web crawler
Web crawler with seo analysis
Describing LDP Applications with the Hydra Core Vocabulary
SemaGrow demonstrator: “Web Crawler + AgroTagger”
Web crawler and applications
Web Crawling & Crawler
Colloquim Report - Rotto Link Web Crawler
Do you need an external search platform for Adobe Experience Manager?
Introduction to W3C Linked Data Platform
Creating Truly RESTful APIs
Introduction to Apache Solr
System Update (2011 CrossRef Workshops)

Viewers also liked (6)

PPTX
PDF
Compost 101 - ENGLISH TEMPLATE
PPTX
Cutworms
PPTX
High tunnel Insect Pest Management (2013 version)
PPTX
Chick peas
PPT
Vegetable IPM workshop for Home Grounds REAs
Compost 101 - ENGLISH TEMPLATE
Cutworms
High tunnel Insect Pest Management (2013 version)
Chick peas
Vegetable IPM workshop for Home Grounds REAs

Similar to Slug: A Semantic Web Crawler (20)

PPTX
SuRf – Tapping Into The Web Of Data
PPTX
Semantic framework for web scraping.
PPT
Talis Platform: A Linked Data Engine
PPTX
Why do they call it Linked Data when they want to say...?
PPT
Finding knowledge, data and answers on the Semantic Web
PDF
Project Progress Report - Recommender Systems for Social Networks
PPTX
Building a Scalable Web Crawler with Hadoop
PPTX
Common crawlpresentation
PPT
Semantic Web Applications
PPTX
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
PDF
Ditching the Middleware
PPTX
RDF-Gen: Generating RDF from streaming and archival data
PPT
Friday talk 11.02.2011
PPTX
webcrawler.pptx
PDF
Some news about the SW
PPTX
SSONDE: Semantic Similarity On liNked Data Entities
PPTX
Semantic Web and Related Work at W3C
PPT
Webcrawler
PPTX
Democratizing Big Semantic Data management
PDF
Semantic web assignment 3
SuRf – Tapping Into The Web Of Data
Semantic framework for web scraping.
Talis Platform: A Linked Data Engine
Why do they call it Linked Data when they want to say...?
Finding knowledge, data and answers on the Semantic Web
Project Progress Report - Recommender Systems for Social Networks
Building a Scalable Web Crawler with Hadoop
Common crawlpresentation
Semantic Web Applications
"RDFa - what, why and how?" by Mike Hewett and Shamod Lacoul
Ditching the Middleware
RDF-Gen: Generating RDF from streaming and archival data
Friday talk 11.02.2011
webcrawler.pptx
Some news about the SW
SSONDE: Semantic Similarity On liNked Data Entities
Semantic Web and Related Work at W3C
Webcrawler
Democratizing Big Semantic Data management
Semantic web assignment 3

More from Leigh Dodds (20)

PDF
Being a data magpie
PDF
How you (yes, you!) can contribute to open data
ODP
Accessible Bath Training
ODP
Accessible Bath
PDF
Cheap bots done quick lightning talk
PDF
Open data in bath
PDF
Bath: Hacked Learning Night: Introduction to CartoDB
ODP
Dungeons and Dragons and Data
ODP
Love the Environment Pre-Meetup
PPT
Why I love open data and you should too
ODP
Introduction to Open Data & Bath: Hacked
ODP
Bath: Hacked: open data, the arts and cultural heritage
PDF
Introduction to Open Data & Linked Data
PDF
Time Travelling with Open Data
PPT
Ignite for Good: Why I Love Open Data and You Should Too
ODP
Oil and Water: When Data Licences Don't Mix
PDF
Linked Data Patterns
PDF
Digital Grafitti for Digital Cities
PDF
Layered Data: An Example
PDF
Data Foundations for Digital Cities
Being a data magpie
How you (yes, you!) can contribute to open data
Accessible Bath Training
Accessible Bath
Cheap bots done quick lightning talk
Open data in bath
Bath: Hacked Learning Night: Introduction to CartoDB
Dungeons and Dragons and Data
Love the Environment Pre-Meetup
Why I love open data and you should too
Introduction to Open Data & Bath: Hacked
Bath: Hacked: open data, the arts and cultural heritage
Introduction to Open Data & Linked Data
Time Travelling with Open Data
Ignite for Good: Why I Love Open Data and You Should Too
Oil and Water: When Data Licences Don't Mix
Linked Data Patterns
Digital Grafitti for Digital Cities
Layered Data: An Example
Data Foundations for Digital Cities

Recently uploaded (20)

PDF
Types of control:Qualitative vs Quantitative
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
Tata consultancy services case study shri Sharda college, basrur
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
DOCX
Euro SEO Services 1st 3 General Updates.docx
PPTX
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
PPTX
Lecture (1)-Introduction.pptx business communication
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
NISM Series V-A MFD Workbook v December 2024.khhhjtgvwevoypdnew one must use ...
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PPTX
Principles of Marketing, Industrial, Consumers,
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
MSPs in 10 Words - Created by US MSP Network
Types of control:Qualitative vs Quantitative
Power and position in leadershipDOC-20250808-WA0011..pdf
Reconciliation AND MEMORANDUM RECONCILATION
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
Tata consultancy services case study shri Sharda college, basrur
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
DOC-20250806-WA0002._20250806_112011_0000.pdf
Euro SEO Services 1st 3 General Updates.docx
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
Lecture (1)-Introduction.pptx business communication
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
ICG2025_ICG 6th steering committee 30-8-24.pptx
Nidhal Samdaie CV - International Business Consultant
NISM Series V-A MFD Workbook v December 2024.khhhjtgvwevoypdnew one must use ...
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
Principles of Marketing, Industrial, Consumers,
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
MSPs in 10 Words - Created by US MSP Network

Slug: A Semantic Web Crawler

  • 1.  
  • 2. Overview Do we need Semantic Web Crawlers? Current Features Crawler Architecture Crawler Configuration Applications and Future Extensions
  • 3. Do We Need Semantic Web Crawlers? Increasing availability of distributed data Mirroring often only option for large sources Varying application needs Real-time retrieval not always necessary/desirable Personal metadata increasingly distributed Need a means to collate data Compiling large, varied datasets for research Triple store and query engine load testing
  • 4. Introducing Slug Open Source multi-threaded web crawler Supports creation of crawler “profiles” Highly extensible Cache content in file system or database Crawl new content, or “freshen” existing data Generates RDF metadata for crawling activity Hopefully(!) easy to use, and well documented
  • 5.  
  • 6. Crawler Architecture Basic Java Framework Multi-threaded retrieval of resources via HTTP Could be used to support other protocols Extensible via RDF configuration file Simple Component Model Content processing and task filtering components Implement custom components for new behaviours Number of built-in behaviours e.g. Crawl depth limiting; URL blacklisting, etc
  • 7.  
  • 9. Consumers Responsible for processing results of tasks Support for multiple consumers per profile RDFConsumer Parses content; Updates memory with triple count Discovers rdfs:seeAlso links; Submits new tasks ResponseStorer Store retrieved content in file system PersistentResponseStorer Store retrieved content in Jena persistent model
  • 10. Task Filters Filters are applied before new Tasks accepted Support for multiple filters per profile Task must pass all filters to be accepted DepthFilter Rejects tasks that are beyond a certain “depth” RegexFilter Reject URLs that match a regular expression SingleFetchFilter Loop avoidance; remove previously encountered URLs
  • 11.  
  • 12. Scutter Profile A combination of configuration options Uses custom RDFS Vocabulary Current options: Number of threads Memory location Memory type (persistent, file system) Specific collection of Consumers and Filters Custom components may have own configuration
  • 13. Example Profile <slug:Scutter rdf:about=&quot;default&quot;> <slug:hasMemory rdf:resource=&quot;memory&quot;/> <!-- consumers for incoming data --> <slug:consumers> <rdf:Seq> <rdf:li rdf:resource=&quot;storer&quot;/> <rdf:li rdf:resource=&quot;rdf-consumer&quot;/> </rdf:Seq> </slug:consumers> </slug:Scutter>
  • 14. Example Consumer <slug:Consumer rdf:about=&quot;rdf-consumer&quot;> <dc:title>RDFConsumer</dc:title> <dc:description>Discovers seeAlso links in RDF models and adds them to task list</dc:description> <slug:impl>com.ldodds.slug.http.RDFConsumer</slug:impl> </slug:Consumer>
  • 15. Sample Filter <slug:Filter rdf:about=&quot;depth-filter&quot;> <dc:title>Limit Depth of Crawling</dc:title> <slug:impl>com.ldodds.slug.http.DepthFilter</slug:impl> <!-- if depth >= this then url not included. Initial depth is 0 --> <slug:depth>3</slug:depth> </slug:Filter>
  • 16. Sample Memory Configuration <slug:Memory rdf:about=&quot;db-memory&quot;> <slug:modelURI rdf:resource=&quot; http://www.example.com/test-model &quot;/> <slug:dbURL>jdbc:mysql://localhost/DB</slug:dbURL> <slug:user>USER</slug:user> <slug:pass>PASSWORD</slug:pass> <slug:dbName>MySQL</slug:dbName> <slug:driver>com.mysql.jdbc.Driver</slug:driver> </slug:Memory>
  • 17.  
  • 18. Scutter Vocabulary Vocabulary for crawl related metadata Where have I been? What responses did I get? Where did I find a reference to this document? Draft Specification by Morten Frederiksen Crawler automatically generates history Components can store additional metadata
  • 19. Scutter Vocab Overview Representation “shadow resource” of a source document scutter:source = URI of source document scutter:origin = URIs which reference source Related to zero or more Fetches ( scutter:fetch ) scutter:latestFetch = Most recent Fetch May be skipped because of previous error ( scutter:skip )
  • 20. Scutter Vocab Overview Fetch Describes a GET of a source document HTTP Headers and Status dc:date scutter:rawTripleCount , included if parsed May have caused a scutter:error and a Reason Reason Why was there an error? Why is a Representation being skipped?
  • 21. List Crawl History for Specific Representation PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX scutter: <http://purl.org/net/scutter/> SELECT ?date ?status ?contentType ?rawTripleCount WHERE { ?representation scutter:fetch ?fetch; scutter:source <http://www.ldodds.com/ldodds.rdf>. ?fetch dc:date ?date. OPTIONAL { ?fetch scutter:status ?status. } OPTIONAL { ?fetch scutter:contentType ?contentType. } OPTIONAL { ?fetch scutter:rawTripleCount ?rawTripleCount. } } ORDER BY DESC(?date)
  • 22.  
  • 23. Working with Slug Traditional Crawling Activities E.g. Adding data to a local database Maintaining a local cache of useful data E.g. Crawl data using file system cache ...and maintain with “ -freshen ” Code for generating LocationMapper configuration Mapping the Semantic Web? Crawl history contains document relationships No need to keep content, just crawl...
  • 24. Future Enhancements Support the Robot Exclusion Protocol Allow configuration of the User-Agent header Implement throttling on a global and per-domain basis Check additional HTTP status codes to &quot;skip&quot; more errors Support white-listing of URLs Expose and capture more statistics while in-progress
  • 25. Future Enhancements Support Content Negotiation to negotiate data Allow pre-processing of data (GRDDL) Follow more than just rdfs:seeAlso links allow configurable link discovery Integrate a “smushing” utility Better manage persistent data Anything else?!
  • 27. Attribution and Licence The following images were used in these slides http://flickr.com/photos/enygmatic/39266262/ http://www.flickr.com/photos/jinglejammer/601987 http://www.flickr.com/photos/sandyplotnikoff/105067900 Thanks to the authors! Licence for this presentation: Creative Commons Attribution-ShareAlike 2.5