SlideShare a Scribd company logo
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
Tatsuhiko Miyagawa
CPAN: MIYAGAWA
abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
 
http://code.sixapart.com/
 
Practical  Web Scraping with Web::Scraper
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
"Screen-scraping is so 1999!"
 
 
RSS is a metadata not a complete  HTML replacement
Practical  Web Scraping with Web::Scraper
What's wrong with LWP & Regexp?
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
It works!
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
It works …
There are 3 problems (at least)
(1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.)
(2) Hard to maintain Regular expression based scrapers are good  Only when they're used in write-only scripts
(3) Improper  HTML & encoding handling
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Vienna
<span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
The &quot;right&quot; way of screen-scraping
(1), (2) Maintainable Less fragile
Use XPath and CSS Selectors
XPath HTML::TreeBuilder::XPath XML::LibXML
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
CSS Selectors &quot;XPath for HTML coders&quot; &quot;XPath for people who hates XML&quot;
CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }
XPath:  //strong[@id=&quot;ctu&quot;] CSS Selector:  strong#ctu
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Robust, Maintainable, and Sane character handling
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
but … long and boring
Practical Web Scraping with  Web::Scraper
Web scraping toolkit inspired by scrapi.rb DSL-ish
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process &quot;strong#ctu&quot;, time => 'TEXT'; result 'time'; }; my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); print $s->scrape($uri);
Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);
process process $selector, $key => $what, … ;
$selector: CSS Selector or XPath (start with /)
$key: key for the result hash append &quot;[]&quot; for looping
$what: '@attr' 'TEXT' Web::Scraper sub { … } Hash reference
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'urls[]'  => ' @href '; # { urls => [ … ] } <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
process '//ul[@class=&quot;sites&quot;]/li/a', 'names[]'  => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
process &quot;ul.sites > li&quot;,  'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result result;  # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key;  # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };
More Examples
 
Thumbnail URLs on Flickr set #!/usr/bin/perl use strict; use Data::Dumper; use Web::Scraper; use URI; my $url = &quot;http://flickr.com/photos/bulknews/sets/72157601700510359/&quot;; my $s = scraper { process &quot;a.image_link img&quot;, &quot;thumbs[]&quot; => '@src'; }; warn Dumper $s->scrape( URI->new($url) );
 
<span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot;  id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
Twitter Friends #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard a&quot;, &quot;people[]&quot; => '@title'; }; warn Dumper $s->scrape( URI->new($url) ) ;
Twitter Friends (complex) #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard&quot;, &quot;people[]&quot; => scraper { process &quot;a&quot;, link => '@href', name => '@title'; process &quot;img&quot;, thumb => '@src'; }; }; warn Dumper $s->scrape( URI->new($url) ) ;
Tools
> cpan Web::Scraper comes with 'scraper' CLI
>  scraper http://example.com/ scraper>  process &quot;a&quot;, &quot;links[]&quot; => '@href'; scraper>  d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper>  y --- links: - http://example.org/ - http://example.net/
>  scraper /path/to/foo.html >  GET http://example.com/ | scraper
TODO
Web::Scraper Needs documentation
More examples to put in eg/ directory
integrate with WWW::Mechanize and Test::WWW::Declare
XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)
Questions?
Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper

More Related Content

PPT
Web Scraper Shibuya.pm tech talk #8
PPT
Real-Time Python Web: Gevent and Socket.io
PPT
Html5, css3, canvas, svg and webgl
PDF
LogStash - Yes, logging can be awesome
ODP
Modern Perl
PDF
AnyMQ, Hippie, and the real-time web
PDF
High Performance Ajax Applications
PDF
Connecting to Web Services on Android
Web Scraper Shibuya.pm tech talk #8
Real-Time Python Web: Gevent and Socket.io
Html5, css3, canvas, svg and webgl
LogStash - Yes, logging can be awesome
Modern Perl
AnyMQ, Hippie, and the real-time web
High Performance Ajax Applications
Connecting to Web Services on Android

What's hot (20)

PDF
LCA2014 - Introduction to Go
PDF
Ruby HTTP clients comparison
PDF
From zero to hero - Easy log centralization with Logstash and Elasticsearch
PDF
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
PDF
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
PDF
Introduction to performance tuning perl web applications
PPT
Triple Blitz Strike
PDF
AJAX Transport Layer
PDF
A reviravolta do desenvolvimento web
PPTX
Socket programming with php
PDF
Lies, Damn Lies, and Benchmarks
PDF
Developing cacheable PHP applications - Confoo 2018
ODP
B03-GenomeContent-Intermine
PDF
Preparing your web services for Android and your Android app for web services...
PDF
Android webservices
PDF
Developing cacheable PHP applications - PHPLimburgBE 2018
KEY
Perl: Hate it for the Right Reasons
PDF
Selenium sandwich-3: Being where you aren't.
PDF
On Centralizing Logs
PDF
Analyse Yourself
LCA2014 - Introduction to Go
Ruby HTTP clients comparison
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Introduction to performance tuning perl web applications
Triple Blitz Strike
AJAX Transport Layer
A reviravolta do desenvolvimento web
Socket programming with php
Lies, Damn Lies, and Benchmarks
Developing cacheable PHP applications - Confoo 2018
B03-GenomeContent-Intermine
Preparing your web services for Android and your Android app for web services...
Android webservices
Developing cacheable PHP applications - PHPLimburgBE 2018
Perl: Hate it for the Right Reasons
Selenium sandwich-3: Being where you aren't.
On Centralizing Logs
Analyse Yourself
Ad

Viewers also liked (20)

PPT
Almost Scraping: Web Scraping without Programming
PPTX
PPTX
Web Scraping and Its Business Benefits
PDF
Relevance Assessment Tool
PDF
When RSS Fails: Web Scraping with HTTP
PPS
Whereismy Dozer
PPT
Java Web Scraping
PDF
Marina Grigorian - Portfolio
PPTX
Scrapy.for.dummies
PDF
Scraping data from the web and documents
PPS
Pivotingskyscrapers
PDF
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
PPS
Bucket Wheel Excavator meets D8r dozer
PPTX
Skyscraper
PPTX
Web Scraping
PPTX
PPTX
Birth of skyscrapers
PDF
Scraper ripper-grader-dozer
PPT
Using Rss
Almost Scraping: Web Scraping without Programming
Web Scraping and Its Business Benefits
Relevance Assessment Tool
When RSS Fails: Web Scraping with HTTP
Whereismy Dozer
Java Web Scraping
Marina Grigorian - Portfolio
Scrapy.for.dummies
Scraping data from the web and documents
Pivotingskyscrapers
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Bucket Wheel Excavator meets D8r dozer
Skyscraper
Web Scraping
Birth of skyscrapers
Scraper ripper-grader-dozer
Using Rss
Ad

Similar to Web::Scraper (20)

PPT
Web::Scraper for SF.pm LT
PPTX
非同期処理の通知処理 with Tatsumaki
ODP
How Xslate Works
PPT
Teflon - Anti Stick for the browser attack surface
PDF
WordPress APIs
PPT
Introduction To Lamp
ODP
Implementing Comet using PHP
PPT
루비가 얼랭에 빠진 날
PPTX
Jade & Javascript templating
PPTX
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
PPTX
Ultra fast web development with sinatra
PPT
Ajax to the Moon
PPT
PPT
&lt;img src="xss.com">
PPT
Ajax ons2
PPT
PHP Presentation
PPT
Searching the Now
PPT
Node js presentation
PPT
Even Faster Web Sites at jQuery Conference '09
PPT
Writing Pluggable Software
Web::Scraper for SF.pm LT
非同期処理の通知処理 with Tatsumaki
How Xslate Works
Teflon - Anti Stick for the browser attack surface
WordPress APIs
Introduction To Lamp
Implementing Comet using PHP
루비가 얼랭에 빠진 날
Jade & Javascript templating
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Ultra fast web development with sinatra
Ajax to the Moon
&lt;img src="xss.com">
Ajax ons2
PHP Presentation
Searching the Now
Node js presentation
Even Faster Web Sites at jQuery Conference '09
Writing Pluggable Software

More from Tatsuhiko Miyagawa (20)

PDF
Carton CPAN dependency manager
KEY
Deploying Plack Web Applications: OSCON 2011
KEY
Plack at OSCON 2010
KEY
cpanminus at YAPC::NA 2010
KEY
Plack at YAPC::NA 2010
KEY
PSGI/Plack OSDC.TW
KEY
Plack perl superglue for web frameworks and servers
KEY
Plack - LPW 2009
KEY
KEY
Intro to PSGI and Plack
KEY
CPAN Realtime feed
KEY
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
PDF
Asynchronous programming with AnyEvent
PDF
Building a desktop app with HTTP::Engine, SQLite and jQuery
PPT
Remedie OSDC.TW
PDF
Why Open Matters It Pro Challenge 2008
PDF
20 modules i haven't yet talked about
PPT
XML::Liberal
PPT
Test::Base
PPT
Hacking Vox and Plagger
Carton CPAN dependency manager
Deploying Plack Web Applications: OSCON 2011
Plack at OSCON 2010
cpanminus at YAPC::NA 2010
Plack at YAPC::NA 2010
PSGI/Plack OSDC.TW
Plack perl superglue for web frameworks and servers
Plack - LPW 2009
Intro to PSGI and Plack
CPAN Realtime feed
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Asynchronous programming with AnyEvent
Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie OSDC.TW
Why Open Matters It Pro Challenge 2008
20 modules i haven't yet talked about
XML::Liberal
Test::Base
Hacking Vox and Plagger

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction

Web::Scraper

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
  • 4. abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
  • 5.  
  • 7.  
  • 8. Practical Web Scraping with Web::Scraper
  • 9. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 10. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 12.  
  • 13.  
  • 14. RSS is a metadata not a complete HTML replacement
  • 15. Practical Web Scraping with Web::Scraper
  • 16. What's wrong with LWP & Regexp?
  • 17.  
  • 18. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 19. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 25. There are 3 problems (at least)
  • 26. (1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.)
  • 27. (2) Hard to maintain Regular expression based scrapers are good Only when they're used in write-only scripts
  • 28. (3) Improper HTML & encoding handling
  • 29. <span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
  • 30. <span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Vienna
  • 31. <span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
  • 32. The &quot;right&quot; way of screen-scraping
  • 33. (1), (2) Maintainable Less fragile
  • 34. Use XPath and CSS Selectors
  • 36. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 37. CSS Selectors &quot;XPath for HTML coders&quot; &quot;XPath for people who hates XML&quot;
  • 38. CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }
  • 39. XPath: //strong[@id=&quot;ctu&quot;] CSS Selector: strong#ctu
  • 40. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 41. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 42. Robust, Maintainable, and Sane character handling
  • 43. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 44. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 45. but … long and boring
  • 46. Practical Web Scraping with Web::Scraper
  • 47. Web scraping toolkit inspired by scrapi.rb DSL-ish
  • 48. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 49. Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process &quot;strong#ctu&quot;, time => 'TEXT'; result 'time'; }; my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); print $s->scrape($uri);
  • 50. Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);
  • 51. process process $selector, $key => $what, … ;
  • 52. $selector: CSS Selector or XPath (start with /)
  • 53. $key: key for the result hash append &quot;[]&quot; for looping
  • 54. $what: '@attr' 'TEXT' Web::Scraper sub { … } Hash reference
  • 55. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 56. process &quot;ul.sites > li > a&quot;, 'urls[]' => ' @href '; # { urls => [ … ] } <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
  • 57. process '//ul[@class=&quot;sites&quot;]/li/a', 'names[]' => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
  • 58. process &quot;ul.sites > li&quot;, 'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 59. process &quot;ul.sites > li > a&quot;, 'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 60. process &quot;ul.sites > li > a&quot;, 'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 61. result result; # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key; # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };
  • 63.  
  • 64. Thumbnail URLs on Flickr set #!/usr/bin/perl use strict; use Data::Dumper; use Web::Scraper; use URI; my $url = &quot;http://flickr.com/photos/bulknews/sets/72157601700510359/&quot;; my $s = scraper { process &quot;a.image_link img&quot;, &quot;thumbs[]&quot; => '@src'; }; warn Dumper $s->scrape( URI->new($url) );
  • 65.  
  • 66. <span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot; id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
  • 67. Twitter Friends #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard a&quot;, &quot;people[]&quot; => '@title'; }; warn Dumper $s->scrape( URI->new($url) ) ;
  • 68. Twitter Friends (complex) #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard&quot;, &quot;people[]&quot; => scraper { process &quot;a&quot;, link => '@href', name => '@title'; process &quot;img&quot;, thumb => '@src'; }; }; warn Dumper $s->scrape( URI->new($url) ) ;
  • 69. Tools
  • 70. > cpan Web::Scraper comes with 'scraper' CLI
  • 71. > scraper http://example.com/ scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href'; scraper> d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper> y --- links: - http://example.org/ - http://example.net/
  • 72. > scraper /path/to/foo.html > GET http://example.com/ | scraper
  • 73. TODO
  • 75. More examples to put in eg/ directory
  • 76. integrate with WWW::Mechanize and Test::WWW::Declare
  • 77. XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)
  • 79. Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper