Web Scraper Shibuya.pm tech talk #8

Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8

Practical Web Scraping with Web::Scraper

Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping

"Screen-scraping is so 1999!"

RSS is a metadata not a complete HTML replacement

What's wrong with LWP & Regexp?

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 > perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1' Monday, August 27, 2007 at 12:49:46

There are 3 problems (at least)

(1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.)

(2) Hard to maintain Regular expression based scrapers are good Only when they're used in write-only scripts

(3) Improper HTML & encoding handling

I &hearts; Shibuya > perl –e '$c =~ m@(.*?)@ and print $1' I &hearts; Shibuya

I &hearts; Shibuya > perl –MHTML::Entities –e '$c =~ m@(.*?)@ and print decode_entities ($1)' I ♥ Shibuya

Perl が大好き！ > perl –MHTML::Entities –MEncode –e '$c =~ m@(.*?)@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き！

The "right" way of screen-scraping

(1), (2) Maintainable Less fragile

XPath HTML::TreeBuilder::XPath XML::LibXML

XPath <td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id="ctu"]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46

CSS Selectors "XPath for HTML coders" "XPath for people who hates XML"

CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }

XPath: //strong[@id="ctu"] CSS Selector: strong#ctu

CSS Selectors <td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath "strong#ctu"; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46

Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;

Robust, Maintainable, and Sane character handling

Exmaple (before) <td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 > perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1' Monday, August 27, 2007 at 12:49:46

Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;

Web scraping toolkit inspired by scrapi.rb DSL-ish

Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;

Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process "strong#ctu", time => 'TEXT'; result 'time'; }; my $uri = URI->new("http://timeanddate.com/worldclock/"); print $s->scrape($uri);

Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);

process process $selector, $key => $what, … ;

$selector: CSS Selector or XPath (start with /)

$key: key for the result hash append "[]" for looping

$what: '@attr' 'TEXT' 'RAW' Web::Scraper sub { … } Hash reference

<ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

process "ul.sites > li > a", 'urls[]' => ' @href '; # { urls => [ … ] } <ul class="sites"> <li><a href=" http://vienna.openguides.org/ ">OpenGuides</a></li> <li><a href=" http://vienna.yapceurope.org/ ">YAPC::Europe</a></li> </ul>

process '//ul[@class="sites"]/li/a', 'names[]' => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class="sites"> <li><a href="http://vienna.openguides.org/"> OpenGuides </a></li> <li><a href="http://vienna.yapceurope.org/"> YAPC::Europe </a></li> </ul>

process "ul.sites > li", 'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

process "ul.sites > li > a", 'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

process "ul.sites > li > a", 'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

result result; # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key; # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };

> cpan Web::Scraper comes with 'scraper' CLI

> scraper http://example.com/ scraper> process "a", "links[]" => '@href'; scraper> d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper> y --- links: - http://example.org/ - http://example.net/

> scraper /path/to/foo.html > GET http://example.com/ | scraper

0.13 'c' and 'c all' WARN in scraper

0.14 automatic absolute URI for link elements (a@href, img@src)

0.15 $Web::Scraper::UserAgent $scraper->user_agent

0.19 support encoding detection w/ META tags

Web::Scraper Needs documentation

More examples to put in eg/ directory

Alternative API inspired by scRUBYt!

OO Backend API if you don't like the DSL

integrate with WWW::Mechanize and Test::WWW::Declare

XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)

generic XML support (e.g. RSS/Atom feeds)

extensible text filter date, geo, hCards (microformats) October 1st, 2007 17:13:31 +0900 process ".entry-date", date => 'TEXT :rfc822 ';

Web::Scraper inspired by scrapi

easy, fun, maintainable & less fragile

Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper

Web Scraper Shibuya.pm tech talk #8

More Related Content

What's hot (20)

Similar to Web Scraper Shibuya.pm tech talk #8 (20)

More from Tatsuhiko Miyagawa (20)

Recently uploaded (20)

Web Scraper Shibuya.pm tech talk #8