Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
Practical  Web Scraping with Web::Scraper
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
"Screen-scraping is so 1999!"
 
 
RSS is a metadata not a complete  HTML replacement
Practical  Web Scraping with Web::Scraper
What's wrong with LWP & Regexp?
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
It works!
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
It works …
There are 3 problems (at least)
(1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.)
(2) Hard to maintain Regular expression based scrapers are good  Only when they're used in write-only scripts
(3) Improper  HTML & encoding handling
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Shibuya
<span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
The &quot;right&quot; way of screen-scraping
(1), (2) Maintainable Less fragile
Use XPath and CSS Selectors
XPath HTML::TreeBuilder::XPath XML::LibXML
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
CSS Selectors &quot;XPath for HTML coders&quot; &quot;XPath for people who hates XML&quot;
CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }
XPath:  //strong[@id=&quot;ctu&quot;] CSS Selector:  strong#ctu
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Robust, Maintainable, and Sane character handling
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
but … long and boring
Practical Web Scraping with  Web::Scraper
Web scraping toolkit inspired by scrapi.rb DSL-ish
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process &quot;strong#ctu&quot;, time => 'TEXT'; result 'time'; }; my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); print $s->scrape($uri);
Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);
process process $selector, $key => $what, … ;
$selector: CSS Selector or XPath (start with /)
$key: key for the result hash append &quot;[]&quot; for looping
$what: '@attr' 'TEXT' 'RAW' Web::Scraper sub { … } Hash reference
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'urls[]'  => ' @href '; # { urls => [ … ] } <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
process '//ul[@class=&quot;sites&quot;]/li/a', 'names[]'  => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
process &quot;ul.sites > li&quot;,  'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result result;  # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key;  # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };
Live Demo
Tools
> cpan Web::Scraper comes with 'scraper' CLI
>  scraper http://example.com/ scraper>  process &quot;a&quot;, &quot;links[]&quot; => '@href'; scraper>  d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper>  y --- links: - http://example.org/ - http://example.net/
>  scraper /path/to/foo.html >  GET http://example.com/ | scraper
Recent Updates
0.13 'c' and 'c all' WARN in scraper
0.14 automatic absolute URI for link elements (a@href, img@src)
0.14 (cont.) 'RAW' and 'HTML'
0.15 $Web::Scraper::UserAgent $scraper->user_agent
0.19 support encoding detection w/ META tags
TODO
Web::Scraper Needs documentation
More examples to put in eg/ directory
Alternative API inspired by scRUBYt!
OO Backend API if you don't like the DSL
integrate with WWW::Mechanize and Test::WWW::Declare
XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)
generic XML support (e.g. RSS/Atom feeds)
extensible text filter date, geo, hCards (microformats) <span class=&quot;entry-date&quot;>October 1st, 2007 17:13:31 +0900</span> process &quot;.entry-date&quot;, date => 'TEXT :rfc822 ';
Summary
Web::Scraper inspired by scrapi
easy, fun, maintainable & less fragile
CSS selector XPath
Questions?
Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper

More Related Content

PPT
Real-Time Python Web: Gevent and Socket.io
PDF
HTTP For the Good or the Bad - FSEC Edition
ODP
Sphinx 1.1 i18n 機能紹介
PDF
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
PPT
루비가 얼랭에 빠진 날
PDF
Apache CouchDB talk at Ontario GNU Linux Fest
PPT
Web::Scraper for SF.pm LT
PDF
Varnish Cache and Django (Falcon, Flask etc)
Real-Time Python Web: Gevent and Socket.io
HTTP For the Good or the Bad - FSEC Edition
Sphinx 1.1 i18n 機能紹介
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
루비가 얼랭에 빠진 날
Apache CouchDB talk at Ontario GNU Linux Fest
Web::Scraper for SF.pm LT
Varnish Cache and Django (Falcon, Flask etc)

What's hot (20)

PPTX
Ultra fast web development with sinatra
PPTX
CouchDB Day NYC 2017: Mango
PPTX
Javascript - The Stack and Beyond
PDF
Stop Worrying & Love the SQL - A Case Study
PDF
Browser Extensions for Web Hackers
PDF
今時なウェブ開発をSmalltalkでやってみる?
PPT
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
PDF
Eve - REST API for Humans™
PPTX
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
PPTX
CouchDB Day NYC 2017: MapReduce Views
ODP
HPPG - high performance photo gallery
PDF
Android webservices
PDF
Debugging and Testing ES Systems
PDF
Nodejs meetup-12-2-2015
PPTX
Non-Relational Databases
PDF
Analyse Yourself
PPTX
CouchDB Day NYC 2017: JSON Documents
PDF
yusukebe in Yokohama.pm 090909
PPTX
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
PDF
Tips on how to improve the performance of your custom modules for high volume...
Ultra fast web development with sinatra
CouchDB Day NYC 2017: Mango
Javascript - The Stack and Beyond
Stop Worrying & Love the SQL - A Case Study
Browser Extensions for Web Hackers
今時なウェブ開発をSmalltalkでやってみる?
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Eve - REST API for Humans™
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: MapReduce Views
HPPG - high performance photo gallery
Android webservices
Debugging and Testing ES Systems
Nodejs meetup-12-2-2015
Non-Relational Databases
Analyse Yourself
CouchDB Day NYC 2017: JSON Documents
yusukebe in Yokohama.pm 090909
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
Tips on how to improve the performance of your custom modules for high volume...
Ad

Similar to Web Scraper Shibuya.pm tech talk #8 (20)

PPT
Web::Scraper
PPT
Introduction To Lamp
PPT
A brief history of the web
PPT
PHP Presentation
ODP
How Xslate Works
ODP
Introducing Modern Perl
PPT
Html5 Overview
PPT
Forum Presentation
PPT
XML processing with perl
PPT
Ubi comp27nov04
PPT
introduction to web technology
ODP
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
PPT
PHP Presentation
PPTX
RESTful design
PDF
WordPress APIs
ODP
Mojolicious on Steroids
PPT
Lecture1 B Frames&Forms
ODP
SlideShare Instant
PPT
SlideShare Instant
Web::Scraper
Introduction To Lamp
A brief history of the web
PHP Presentation
How Xslate Works
Introducing Modern Perl
Html5 Overview
Forum Presentation
XML processing with perl
Ubi comp27nov04
introduction to web technology
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
PHP Presentation
RESTful design
WordPress APIs
Mojolicious on Steroids
Lecture1 B Frames&Forms
SlideShare Instant
SlideShare Instant
Ad

More from Tatsuhiko Miyagawa (20)

PDF
Carton CPAN dependency manager
KEY
Deploying Plack Web Applications: OSCON 2011
KEY
Plack at OSCON 2010
KEY
cpanminus at YAPC::NA 2010
KEY
Plack at YAPC::NA 2010
KEY
PSGI/Plack OSDC.TW
KEY
Plack perl superglue for web frameworks and servers
KEY
Plack - LPW 2009
KEY
KEY
Intro to PSGI and Plack
KEY
CPAN Realtime feed
KEY
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
PDF
Asynchronous programming with AnyEvent
PDF
Building a desktop app with HTTP::Engine, SQLite and jQuery
PPT
Remedie OSDC.TW
PDF
Why Open Matters It Pro Challenge 2008
PDF
20 modules i haven't yet talked about
PPT
XML::Liberal
PPT
Test::Base
PPT
Hacking Vox and Plagger
Carton CPAN dependency manager
Deploying Plack Web Applications: OSCON 2011
Plack at OSCON 2010
cpanminus at YAPC::NA 2010
Plack at YAPC::NA 2010
PSGI/Plack OSDC.TW
Plack perl superglue for web frameworks and servers
Plack - LPW 2009
Intro to PSGI and Plack
CPAN Realtime feed
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Asynchronous programming with AnyEvent
Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie OSDC.TW
Why Open Matters It Pro Challenge 2008
20 modules i haven't yet talked about
XML::Liberal
Test::Base
Hacking Vox and Plagger

Recently uploaded (20)

PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Hybrid model detection and classification of lung cancer
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
August Patch Tuesday
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Chapter 5: Probability Theory and Statistics
DOCX
search engine optimization ppt fir known well about this
PPTX
The various Industrial Revolutions .pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
STKI Israel Market Study 2025 version august
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A review of recent deep learning applications in wood surface defect identifi...
A novel scalable deep ensemble learning framework for big data classification...
Zenith AI: Advanced Artificial Intelligence
Hybrid model detection and classification of lung cancer
A comparative study of natural language inference in Swahili using monolingua...
August Patch Tuesday
Getting started with AI Agents and Multi-Agent Systems
observCloud-Native Containerability and monitoring.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Chapter 5: Probability Theory and Statistics
search engine optimization ppt fir known well about this
The various Industrial Revolutions .pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
WOOl fibre morphology and structure.pdf for textiles
STKI Israel Market Study 2025 version august
Developing a website for English-speaking practice to English as a foreign la...
sustainability-14-14877-v2.pddhzftheheeeee
DP Operators-handbook-extract for the Mautical Institute
1 - Historical Antecedents, Social Consideration.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Web Scraper Shibuya.pm tech talk #8

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
  • 2. Practical Web Scraping with Web::Scraper
  • 3. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 4. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 6.  
  • 7.  
  • 8. RSS is a metadata not a complete HTML replacement
  • 9. Practical Web Scraping with Web::Scraper
  • 10. What's wrong with LWP & Regexp?
  • 11.  
  • 12. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 13. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 19. There are 3 problems (at least)
  • 20. (1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.)
  • 21. (2) Hard to maintain Regular expression based scrapers are good Only when they're used in write-only scripts
  • 22. (3) Improper HTML & encoding handling
  • 23. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  • 24. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Shibuya
  • 25. <span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
  • 26. The &quot;right&quot; way of screen-scraping
  • 27. (1), (2) Maintainable Less fragile
  • 28. Use XPath and CSS Selectors
  • 30. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 31. CSS Selectors &quot;XPath for HTML coders&quot; &quot;XPath for people who hates XML&quot;
  • 32. CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }
  • 33. XPath: //strong[@id=&quot;ctu&quot;] CSS Selector: strong#ctu
  • 34. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 35. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 36. Robust, Maintainable, and Sane character handling
  • 37. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 38. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 39. but … long and boring
  • 40. Practical Web Scraping with Web::Scraper
  • 41. Web scraping toolkit inspired by scrapi.rb DSL-ish
  • 42. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 43. Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process &quot;strong#ctu&quot;, time => 'TEXT'; result 'time'; }; my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); print $s->scrape($uri);
  • 44. Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);
  • 45. process process $selector, $key => $what, … ;
  • 46. $selector: CSS Selector or XPath (start with /)
  • 47. $key: key for the result hash append &quot;[]&quot; for looping
  • 48. $what: '@attr' 'TEXT' 'RAW' Web::Scraper sub { … } Hash reference
  • 49. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 50. process &quot;ul.sites > li > a&quot;, 'urls[]' => ' @href '; # { urls => [ … ] } <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
  • 51. process '//ul[@class=&quot;sites&quot;]/li/a', 'names[]' => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
  • 52. process &quot;ul.sites > li&quot;, 'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 53. process &quot;ul.sites > li > a&quot;, 'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 54. process &quot;ul.sites > li > a&quot;, 'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 55. result result; # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key; # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };
  • 57. Tools
  • 58. > cpan Web::Scraper comes with 'scraper' CLI
  • 59. > scraper http://example.com/ scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href'; scraper> d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper> y --- links: - http://example.org/ - http://example.net/
  • 60. > scraper /path/to/foo.html > GET http://example.com/ | scraper
  • 62. 0.13 'c' and 'c all' WARN in scraper
  • 63. 0.14 automatic absolute URI for link elements (a@href, img@src)
  • 64. 0.14 (cont.) 'RAW' and 'HTML'
  • 66. 0.19 support encoding detection w/ META tags
  • 67. TODO
  • 69. More examples to put in eg/ directory
  • 71. OO Backend API if you don't like the DSL
  • 72. integrate with WWW::Mechanize and Test::WWW::Declare
  • 73. XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)
  • 74. generic XML support (e.g. RSS/Atom feeds)
  • 75. extensible text filter date, geo, hCards (microformats) <span class=&quot;entry-date&quot;>October 1st, 2007 17:13:31 +0900</span> process &quot;.entry-date&quot;, date => 'TEXT :rfc822 ';
  • 78. easy, fun, maintainable & less fragile
  • 81. Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper