SlideShare a Scribd company logo
XQuery: Querying the World(formerly known as Web Scraping)Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
EvolutionWeb Scraping
PHP (2007)$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";$raw = file_get_contents($url);$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");$content = str_replace($newlines, "", html_entity_decode($raw));$start = strpos($content,'<table cellpadding="2" class="standard_table"');$end = strpos($content,'</table>',$start) + 8;$table = substr($content,$start,$end-$start);preg_match_all("|<tr(.*)</tr>|U",$table,$rows);foreach ($rows[0] as $row){    if ((strpos($row,'<th')===false)){        preg_match_all("|<td(.*)</td>|U",$row,$cells);        $number = strip_tags($cells[0][0]);        $name = strip_tags($cells[0][1]);        $position = strip_tags($cells[0][2]);        echo "{$position} - {$name} - Number {$number} <br>\n";    }}$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";$raw = file_get_contents($url);$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");$content = str_replace($newlines, "", html_entity_decode($raw));$start = strpos($content,'<table cellpadding="2" class="standard_table"');$end = strpos($content,'</table>',$start) + 8;$table = substr($content,$start,$end-$start);preg_match_all("|<tr(.*)</tr>|U",$table,$rows);foreach ($rows[0] as $row){    if ((strpos($row,'<th')===false)){        preg_match_all("|<td(.*)</td>|U",$row,$cells);        $number = strip_tags($cells[0][0]);        $name = strip_tags($cells[0][1]);        $position = strip_tags($cells[0][2]);        echo "{$position} - {$name} - Number {$number} <br>\n";    }}source: http://www.bradino.com/php/screen-scraping/
PHP (June 2011)$url="http://www.rtu.ac.in/results/reformat.php";$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";$ch=curl_init();curl_setopt($ch,CURLOPT_URL,$url);curl_setopt($ch,CURLOPT_POST,1);curl_setopt($ch,CURLOPT_POSTFIELDS,$post);curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);$content=curl_exec($ch);curl_close($ch);$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";$page=new DOMDocument();$xpath=new DOMXPath($page);$page->loadHTML($content);$page->saveHTML();  // this shows the page contents$total=$xpath->query($totalPath);echo $total->length;    //shows 0echo $total->item(0)->nodeValue;   //shows nothing$url="http://www.rtu.ac.in/results/reformat.php";$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";$ch=curl_init();curl_setopt($ch,CURLOPT_URL,$url);curl_setopt($ch,CURLOPT_POST,1);curl_setopt($ch,CURLOPT_POSTFIELDS,$post);curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);$content=curl_exec($ch);curl_close($ch);$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";$page=new DOMDocument();$xpath=new DOMXPath($page);$page->loadHTML($content);$page->saveHTML();  // this shows the page contents$total=$xpath->query($totalPath);echo $total->length;    //shows 0echo $total->item(0)->nodeValue;   //shows nothing!!source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
XQuery
Real WorldExample
awesome siteawesome datano API
Deal with sessions
Need to emulate setting options
Different NotionsPublisher <=> Consumer
JSON ?XML ?CSV !HTML !XLS !Zip !AppWebsite
Stateless REST API ?JSON ?XML ?CSV !HTML !XLS !Zip !Session!AppWebsite
Stateless REST API ?JSON ?XML ?CSV !HTML !XLS !Zip !Session!AppWebsiteCustomize with URL ParamsHTML Forms
Stateless REST API ?JSON ?XML ?CSV !HTML !XLS !Zip !Session!AppWebsiteCustomize with URL ParamsHTML Forms
CSV !HTML !XLS !Zip !HTML !Session!Session!AppWebsiteXQuery !HTML FormsHTML Forms
Summary
Session handlingForms!!XQuery Web Data ProcessingA browser can do it?                 XQuery can do it!
Result:http://www.unemployment.by/country

More Related Content

PDF
20 modules i haven't yet talked about
PDF
PythonでJWT生成からボット作成、投稿までやってみた
PDF
PHP and Rich Internet Applications
PDF
Pemrograman Web 9 - Input Form DB dan Session
PPTX
Building Your First Widget
PDF
Pemrograman Web 8 - MySQL
TXT
Cpsh sh
PDF
20 modules i haven't yet talked about
PythonでJWT生成からボット作成、投稿までやってみた
PHP and Rich Internet Applications
Pemrograman Web 9 - Input Form DB dan Session
Building Your First Widget
Pemrograman Web 8 - MySQL
Cpsh sh

What's hot (20)

PDF
Perl6 operators and metaoperators
PDF
C A S Sample Php
PDF
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
TXT
R57.Php
PDF
PhoneGap: Local Storage
TXT
PDF
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
PDF
PHPUnit でよりよくテストを書くために
PDF
Perl Bag of Tricks - Baltimore Perl mongers
PDF
MySQL Create Table
DOC
PDF
The Magic Of Tie
PPTX
So cal0365productivitygroup feb2019
PDF
IsTrue(true)?
PDF
Teaching Your Machine To Find Fraudsters
PDF
Debugging: Rules And Tools - PHPTek 11 Version
PDF
How to stand on the shoulders of giants
PDF
PHP Tips & Tricks
PDF
Coding website
Perl6 operators and metaoperators
C A S Sample Php
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
R57.Php
PhoneGap: Local Storage
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
PHPUnit でよりよくテストを書くために
Perl Bag of Tricks - Baltimore Perl mongers
MySQL Create Table
The Magic Of Tie
So cal0365productivitygroup feb2019
IsTrue(true)?
Teaching Your Machine To Find Fraudsters
Debugging: Rules And Tools - PHPTek 11 Version
How to stand on the shoulders of giants
PHP Tips & Tricks
Coding website
Ad

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Reach Out and Touch Someone: Haptics and Empathic Computing
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectroscopy.pptx food analysis technology
Understanding_Digital_Forensics_Presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
Ad

London XQuery Meetup: Querying the World (Web Scraping)

Editor's Notes

  • #4: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • #5: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • #6: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page