Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Big Data and Automated Content Analysis
Week 6 – Monday
»Web scraping«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
2 May 2016
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Today
1 We put on our magic cap, pretend we are Firefox, scrape all
comments from GeenStijl, clean up the mess, and put the
comments in a neat CSV table
2 OK, but this surely can be doe more elegantly? Yes!
3 Next steps
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
We put on our magic cap, pretend we are Firefox, scrape all
comments from GeenStijl, clean up the mess, and put the
comments in a neat CSV table
Big Data and Automated Content Analysis Damian Trilling
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Let’s make a plan!
Which elements from the page do we need?
• What do they mean?
• How are they represented in the source code?
How should our output look like?
• What lists do we want?
• . . .
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
6 Put text and metadata in lists and save them to a csv file
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
6 Put text and metadata in lists and save them to a csv file
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
1 from urllib import request
2 import re
3 import csv
4
5 onlycommentslist=[]
6 metalist=[]
7
8 req = request.Request("http://www.geenstijl.nl/mt/archieven/2014/05/
das_toch_niet_normaal.html", headers={"User-Agent" : "Mozilla
/5.0"})
9 tekst=request.urlopen(req).read()
10 tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("n"," ").
replace("t"," ")
11
12 commentsection=re.findall(r’<div class="commentlist">.*?</div>’,tekst)
13 print (commentsection)
14 comments=re.findall(r’<article.*?>(.*?)</article>’,commentsection[0])
15 print (comments)
16 print ("There are",len(comments),"comments")
17 for co in comments:
18 metalist.append(re.findall(r’<footer>(.*?)</footer>’,co))
19 onlycommentslist.append(re.findall(r’<p>(.*?)</p>’,co))
20 writer=csv.writer(open("geenstijlcomments.csv",mode="w",encoding="utf
-8"))
21 output=zip(onlycommentslist,metalist)
22 writer.writerows(output)
BDACA1516s2 - Lecture6
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <footer> and </footer>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <footer> and </footer>)
Optimization
• Only save the 0th (and only) element of the list
• Seperate the username and interpret date and time
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Further reading
Doing this with other sites?
• It’s basically puzzling with regular expressions.
• Look at the source code of the website to see how
well-structured it is.
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
OK, but this surely can be doe more elegantly? Yes!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
The following example is based on http://www.chicagoreader.com/
chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It
uses the module lxml
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
What do we need?
• the URL (of course)
• the XPATH of the element we want to scrape (you’ll see in a
minute what this is)
Big Data and Automated Content Analysis Damian Trilling
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
• Have a look at the source code of the web page to think of
other possible XPATHs!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
The * means “every”.
Also, to get the text of the element, the XPATH should end on
/text().
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
The * means “every”.
Also, to get the text of the element, the XPATH should end on
/text().
We can infer that we (probably) get all comments with
//*[@id="tabbedReviewsDiv"]/dl[*]/dd/text()
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Let’s scrape them!
1 from lxml import html
2 from urllib import request
3
4 req=request.Request("http://www.kieskeurig.nl/tablet/samsung/
galaxy_tab_3_101_wifi_16gb/reviews/1344691")
5 tree = html.fromstring(request.urlopen(req).read().decode(encoding="utf
-8",errors="ignore"))
6
7 reviews = tree.xpath(’//*[@id="reviews-container"]//*[@class="text
margin-mobile-bottom-large"]/text()’)
8
9 print (len(reviews),"reviews scraped. Showing the first 60 characters of
each:")
10 i=0
11 for review in reviews:
12 print("Review",i,":",review[:60])
13 i+=1
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The output – perfect!
1 34 reviews scraped. Showing the first 60 characters of each:
2 Review 0 : Ideaal in combinatie met onze Samsung curved tv en onze mobi
3 Review 1 : Gewoon een goed ding!!!!Ligt goed in de hand. Is duidelijk e
4 Review 2 : Prachtig mooi levendig beeld, hoever of kort bij zit maakt n
5 Review 3 : Opstartsnelheid is zeer snel.
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Recap
General idea
1 Identify each element by its XPATH (look it up in your
browser)
2 Read the webpage into a (loooooong) string
3 Use the XPATH to extract the relevant text into a list (with a
module like lxml)
4 Do something with the list (preprocess, analyze, save)
Alternatives: scrapy, beautifulsoup, regular expressions, . . .
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Last remarks
There is often more than one way to specify an XPATH
1 You can usually leave away the namespace (the x:)
2 Sometimes, you might want to use a different suggestion to
be able to generalize better (e.g., using the attributes rather
than the tags)
3 in that case, it makes sense to look deeper into the structure
of the HTML code, for example with “Inspect Element” and
use that information to play around with in the XPATH
Checker
Big Data and Automated Content Analysis Damian Trilling
BDACA1516s2 - Lecture6
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Next steps
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
From now on. . .
. . . focus on individual projects!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Wednesday
• Write a scraper for a website of your choice!
• Prepare a bit, so that you know where to start
• Suggestions: Review texts and grades from http://iens.nl,
comments from http://nujij.nl
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot (20)

PDF
Analyzing social media with Python and other tools (1/4)
PDF
Analyzing social media with Python and other tools (2/4)
PPTX
Basics of IR: Web Information Systems class
PDF
Working with text data
PPTX
The Web of Data: do we actually understand what we built?
PDF
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (2/4)
Basics of IR: Web Information Systems class
Working with text data
The Web of Data: do we actually understand what we built?
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Ad

Similar to BDACA1516s2 - Lecture6 (20)

PDF
Python for Data Science
PPTX
Data web analytics scraping 12345_II.pptx
PPTX
Python FDP self learning presentations..
PDF
Web Page Test - Beyond the Basics
PPTX
Web Scraping Basics
KEY
Scraping Scripting Hacking
PDF
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
PDF
What is Web-scraping?
PDF
Web scraping in python
PDF
Scraping with Python for Fun and Profit - PyCon India 2010
PDF
OpenFest 2012 : Leveraging the public internet
PDF
Intro to web scraping with Python
KEY
/me wants it. Scraping sites to get data.
PPT
Hatkit Project - Datafiddler
PPTX
Jeremy cabral search marketing summit - scraping data-driven content (1)
PPTX
SEO for Large Websites
PPTX
Web content mining
PPTX
Datasets, APIs, and Web Scraping
Python for Data Science
Data web analytics scraping 12345_II.pptx
Python FDP self learning presentations..
Web Page Test - Beyond the Basics
Web Scraping Basics
Scraping Scripting Hacking
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
What is Web-scraping?
Web scraping in python
Scraping with Python for Fun and Profit - PyCon India 2010
OpenFest 2012 : Leveraging the public internet
Intro to web scraping with Python
/me wants it. Scraping sites to get data.
Hatkit Project - Datafiddler
Jeremy cabral search marketing summit - scraping data-driven content (1)
SEO for Large Websites
Web content mining
Datasets, APIs, and Web Scraping
Ad

More from Department of Communication Science, University of Amsterdam (12)

Recently uploaded (20)

PDF
advance database management system book.pdf
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
PPTX
Computer Architecture Input Output Memory.pptx
PPTX
What’s under the hood: Parsing standardized learning content for AI
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
Journal of Dental Science - UDMY (2021).pdf
PDF
semiconductor packaging in vlsi design fab
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
Complications of Minimal Access-Surgery.pdf
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PPTX
Education and Perspectives of Education.pptx
advance database management system book.pdf
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Share_Module_2_Power_conflict_and_negotiation.pptx
AI-driven educational solutions for real-life interventions in the Philippine...
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
Computer Architecture Input Output Memory.pptx
What’s under the hood: Parsing standardized learning content for AI
Paper A Mock Exam 9_ Attempt review.pdf.
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Journal of Dental Science - UDMY (2021).pdf
semiconductor packaging in vlsi design fab
B.Sc. DS Unit 2 Software Engineering.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Unit 4 Computer Architecture Multicore Processor.pptx
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
Complications of Minimal Access-Surgery.pdf
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
Education and Perspectives of Education.pptx

BDACA1516s2 - Lecture6

  • 1. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Big Data and Automated Content Analysis Week 6 – Monday »Web scraping« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 2 May 2016 Big Data and Automated Content Analysis Damian Trilling
  • 2. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Today 1 We put on our magic cap, pretend we are Firefox, scrape all comments from GeenStijl, clean up the mess, and put the comments in a neat CSV table 2 OK, but this surely can be doe more elegantly? Yes! 3 Next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps We put on our magic cap, pretend we are Firefox, scrape all comments from GeenStijl, clean up the mess, and put the comments in a neat CSV table Big Data and Automated Content Analysis Damian Trilling
  • 7. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Let’s make a plan! Which elements from the page do we need? • What do they mean? • How are they represented in the source code? How should our output look like? • What lists do we want? • . . . And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 8. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap Big Data and Automated Content Analysis Damian Trilling
  • 9. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page Big Data and Automated Content Analysis Damian Trilling
  • 10. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! Big Data and Automated Content Analysis Damian Trilling
  • 11. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string Big Data and Automated Content Analysis Damian Trilling
  • 12. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) Big Data and Automated Content Analysis Damian Trilling
  • 13. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) Big Data and Automated Content Analysis Damian Trilling
  • 14. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) Big Data and Automated Content Analysis Damian Trilling
  • 15. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) 6 Put text and metadata in lists and save them to a csv file Big Data and Automated Content Analysis Damian Trilling
  • 16. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) 6 Put text and metadata in lists and save them to a csv file And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 17. 1 from urllib import request 2 import re 3 import csv 4 5 onlycommentslist=[] 6 metalist=[] 7 8 req = request.Request("http://www.geenstijl.nl/mt/archieven/2014/05/ das_toch_niet_normaal.html", headers={"User-Agent" : "Mozilla /5.0"}) 9 tekst=request.urlopen(req).read() 10 tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("n"," "). replace("t"," ") 11 12 commentsection=re.findall(r’<div class="commentlist">.*?</div>’,tekst) 13 print (commentsection) 14 comments=re.findall(r’<article.*?>(.*?)</article>’,commentsection[0]) 15 print (comments) 16 print ("There are",len(comments),"comments") 17 for co in comments: 18 metalist.append(re.findall(r’<footer>(.*?)</footer>’,co)) 19 onlycommentslist.append(re.findall(r’<p>(.*?)</p>’,co)) 20 writer=csv.writer(open("geenstijlcomments.csv",mode="w",encoding="utf -8")) 21 output=zip(onlycommentslist,metalist) 22 writer.writerows(output)
  • 19. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). Big Data and Automated Content Analysis Damian Trilling
  • 20. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <footer> and </footer>) Big Data and Automated Content Analysis Damian Trilling
  • 21. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <footer> and </footer>) Optimization • Only save the 0th (and only) element of the list • Seperate the username and interpret date and time Big Data and Automated Content Analysis Damian Trilling
  • 22. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Further reading Doing this with other sites? • It’s basically puzzling with regular expressions. • Look at the source code of the website to see how well-structured it is. Big Data and Automated Content Analysis Damian Trilling
  • 23. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps OK, but this surely can be doe more elegantly? Yes! Big Data and Automated Content Analysis Damian Trilling
  • 24. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) Big Data and Automated Content Analysis Damian Trilling
  • 25. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) Big Data and Automated Content Analysis Damian Trilling
  • 26. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex Big Data and Automated Content Analysis Damian Trilling
  • 27. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex The following example is based on http://www.chicagoreader.com/ chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It uses the module lxml Big Data and Automated Content Analysis Damian Trilling
  • 28. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps What do we need? • the URL (of course) • the XPATH of the element we want to scrape (you’ll see in a minute what this is) Big Data and Automated Content Analysis Damian Trilling
  • 31. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Big Data and Automated Content Analysis Damian Trilling
  • 32. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) Big Data and Automated Content Analysis Damian Trilling
  • 33. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all Big Data and Automated Content Analysis Damian Trilling
  • 34. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ Big Data and Automated Content Analysis Damian Trilling
  • 35. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text Big Data and Automated Content Analysis Damian Trilling
  • 36. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text • Have a look at the source code of the web page to think of other possible XPATHs! Big Data and Automated Content Analysis Damian Trilling
  • 37. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd Big Data and Automated Content Analysis Damian Trilling
  • 38. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd The * means “every”. Also, to get the text of the element, the XPATH should end on /text(). Big Data and Automated Content Analysis Damian Trilling
  • 39. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd The * means “every”. Also, to get the text of the element, the XPATH should end on /text(). We can infer that we (probably) get all comments with //*[@id="tabbedReviewsDiv"]/dl[*]/dd/text() Big Data and Automated Content Analysis Damian Trilling
  • 40. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Let’s scrape them! 1 from lxml import html 2 from urllib import request 3 4 req=request.Request("http://www.kieskeurig.nl/tablet/samsung/ galaxy_tab_3_101_wifi_16gb/reviews/1344691") 5 tree = html.fromstring(request.urlopen(req).read().decode(encoding="utf -8",errors="ignore")) 6 7 reviews = tree.xpath(’//*[@id="reviews-container"]//*[@class="text margin-mobile-bottom-large"]/text()’) 8 9 print (len(reviews),"reviews scraped. Showing the first 60 characters of each:") 10 i=0 11 for review in reviews: 12 print("Review",i,":",review[:60]) 13 i+=1 Big Data and Automated Content Analysis Damian Trilling
  • 41. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The output – perfect! 1 34 reviews scraped. Showing the first 60 characters of each: 2 Review 0 : Ideaal in combinatie met onze Samsung curved tv en onze mobi 3 Review 1 : Gewoon een goed ding!!!!Ligt goed in de hand. Is duidelijk e 4 Review 2 : Prachtig mooi levendig beeld, hoever of kort bij zit maakt n 5 Review 3 : Opstartsnelheid is zeer snel. Big Data and Automated Content Analysis Damian Trilling
  • 42. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Recap General idea 1 Identify each element by its XPATH (look it up in your browser) 2 Read the webpage into a (loooooong) string 3 Use the XPATH to extract the relevant text into a list (with a module like lxml) 4 Do something with the list (preprocess, analyze, save) Alternatives: scrapy, beautifulsoup, regular expressions, . . . Big Data and Automated Content Analysis Damian Trilling
  • 43. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Last remarks There is often more than one way to specify an XPATH 1 You can usually leave away the namespace (the x:) 2 Sometimes, you might want to use a different suggestion to be able to generalize better (e.g., using the attributes rather than the tags) 3 in that case, it makes sense to look deeper into the structure of the HTML code, for example with “Inspect Element” and use that information to play around with in the XPATH Checker Big Data and Automated Content Analysis Damian Trilling
  • 45. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Next steps Big Data and Automated Content Analysis Damian Trilling
  • 46. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps From now on. . . . . . focus on individual projects! Big Data and Automated Content Analysis Damian Trilling
  • 47. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Wednesday • Write a scraper for a website of your choice! • Prepare a bit, so that you know where to start • Suggestions: Review texts and grades from http://iens.nl, comments from http://nujij.nl Big Data and Automated Content Analysis Damian Trilling