SlideShare a Scribd company logo
DRAFT VERSION v0.1 
First steps with Scrapy 
@Francisco Sousa
WHAT IS SCRAPY?
Scrapy is an open source and collaborative 
framework for extracting the data you 
need from websites. 
It’s made in Python!
Who is it for?
Scrapy is for everyone that want to collect 
data from one or many websites.
“The advantage of scraping is that you can 
do it with virtually any web site - from 
weather forecasts to government 
spending, even if that site does not have 
an API for raw data access” 
Friedrich Lindenberg
Alternatives?
There are many alternatives as: 
• Lxml 
• Beatiful Soup 
• Mechanize 
• Newspaper
Advantages of Scrapy?
• It’s free 
• It’s cross platform (Windows, 
Linux, Mac OS and BSD) 
• Fast and powerfull
Disadvantages of 
Scrapy?
• It’s only for python 2.7.+ 
• It’s has a bigger learnig curve that 
some other alternatives 
• Installation it’s different according 
the operating system
Let’s start!
First of all you will have to install it so do: 
pip install scrapy 
or 
sudo pip install scrapy 
Note: with this command will be installed scrapy 
and their dependencies. 
On Windows you will have to install pywin32
Create our first project
Before we starting scraping information, 
we will create an scrapy project, so go to 
directory where you want to create the 
project and write the follow command: 
scrapy startproject demo
The command before will create the 
skeleton for your project, as you can see 
on the figure bellow:
The files created are the core of our 
project, so it’s important that you 
understand the basics: 
• scrapy.cfg: the project configuration file 
• demo/: the project’s python module, you’ll later import 
your code from here. 
• demo/items.py: the project’s items file. 
• demo/pipelines.py: the project’s pipelines file. 
• demo/settings.py: the project’s settings file. 
• demo/spiders/: a directory where you’ll later put your 
spiders.
Choose an Website to 
scrape
After we have the skeleton of the project, 
the next logical step is choose among the 
number of websites in the world, what is 
website that we want get information
I choose for this example scrape 
information from the website: 
That is an important website of technology 
news
Because the verge is a giant website, I 
decide that I will only try to get 
information from the last reviews of The 
Verge. 
So we have to follow the next steps: 
1 See what is the url for reviews 
2 Define how many pages we want to get of reviews 
3 Define what information to scrape 
4 Create a spider
See what is the url for reviews 
http://www.theverge.com/reviews
Define how many pages we want to get of 
reviews. For simplicity we will choose 
scrape only the first 5 pages of The Verge 
• http://www.theverge.com/reviews/1 
• http://www.theverge.com/reviews/2 
• http://www.theverge.com/reviews/3 
• http://www.theverge.com/reviews/4 
• http://www.theverge.com/reviews/5
Define what information 
you want to scrape:
3 
1 
2 
1 Title of the article 
2 Number of comments 
3 Author of the article
Create the fields for the information that 
you want to scrape on Python
Create a spider
Scrapy
name: identifies the Spider. It must be 
unique! 
start_urls: is a list of URLs where the 
Spider will begin to crawl from. 
parse: is a method of the spider, which will 
be called with the 
downloaded Response object of each start 
URL..
How to run my spider?
This is the easy part, to run our spider we 
have to simple to the following command: 
scrapy runspider <spider_file.py> 
E.g: scrapy runspider the_verge.py
How to store 
information of my spider 
on a file?
To store the information of our spider we 
have to execute the following command: 
scrapy runspider the_verge.py -o 
items.json
You have other formats like CSV and XML: 
CSV: 
scrapy runspider the_verge.py -o items.csv 
XML: 
scrapy runspider the_verge.py -o 
items.xml
Conclusion
In this presentation you learn the concepts 
key of scrapy and how to create a simple 
spider. Now is time to put hands to work 
and experiment other things :D
Thanks!
Appendix
Bibliography 
http://datajournalismhandbook.org/1.0/e 
n/getting_data_3.html 
https://pypi.python.org/pypi/Scrapy 
http://scrapy.org/ 
http://doc.scrapy.org/
Code available in: 
https://github.com/FranciscoSousaDeveloper/demo 
Contact: 
pt.linkedin.com/pub/francisco-sousa/4a/921/6a3/ 
@Francisco Sousa

More Related Content

PDF
Tutorial on Web Scraping in Python
ODP
Introduction to Web Scraping using Python and Beautiful Soup
PPT
Web Scraping and Data Extraction Service
PDF
What is Web-scraping?
PDF
Intro to web scraping with Python
PPTX
Web Scraping Basics
PPTX
Web Scrapping Using Python
PDF
What is web scraping?
Tutorial on Web Scraping in Python
Introduction to Web Scraping using Python and Beautiful Soup
Web Scraping and Data Extraction Service
What is Web-scraping?
Intro to web scraping with Python
Web Scraping Basics
Web Scrapping Using Python
What is web scraping?

What's hot (20)

PPTX
Scrapy-101
PPTX
Web Scraping using Python | Web Screen Scraping
PPTX
Full Stack Web Development
PPTX
Full stack development
PPTX
Scrapy.for.dummies
PDF
Web development ppt
PPTX
Web Application
PPTX
Web Development
PPTX
Mobile application architecture
PPTX
Python/Flask Presentation
PPTX
Puppeteer
PDF
XSS Magic tricks
PPTX
Burp Suite Starter
PPT
Framework PPT
PDF
Understanding Reactive Programming
PPTX
Why Progressive Web App is what you need for your Business
PPT
Webcrawler
PPTX
Mobile Web Apps
PPTX
Web development using javaScript, React js, Node js, HTML, CSS and SQL
PPTX
Centralized Logging System Using ELK Stack
Scrapy-101
Web Scraping using Python | Web Screen Scraping
Full Stack Web Development
Full stack development
Scrapy.for.dummies
Web development ppt
Web Application
Web Development
Mobile application architecture
Python/Flask Presentation
Puppeteer
XSS Magic tricks
Burp Suite Starter
Framework PPT
Understanding Reactive Programming
Why Progressive Web App is what you need for your Business
Webcrawler
Mobile Web Apps
Web development using javaScript, React js, Node js, HTML, CSS and SQL
Centralized Logging System Using ELK Stack
Ad

Viewers also liked (20)

PDF
Scraping the web with python
PDF
Downloading the internet with Python + Scrapy
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
PDF
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
PDF
Web Scraping in Python with Scrapy
PDF
Scrapinghub PyCon Philippines 2015
PDF
Scrapy workshop
PPTX
Web crawler - Scrapy
PPT
Java Script Based Client Server Webapps 2
PPTX
快快樂樂學 Scrapy
PPTX
RESTful API Design Fundamentals
PDF
XPath for web scraping
PDF
Web Crawling Modeling with Scrapy Models #TDC2014
PDF
Pydata-Python tools for webscraping
PDF
Python, web scraping and content management: Scrapy and Django
PDF
Website vs web app
PDF
電腦不只會幫你選土豆,還會幫你選新聞
PPTX
均一Gae甘苦談
PDF
Mobile Website vs Mobile App
PPTX
Web Engineering - Web Applications versus Conventional Software
Scraping the web with python
Downloading the internet with Python + Scrapy
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web Scraping in Python with Scrapy
Scrapinghub PyCon Philippines 2015
Scrapy workshop
Web crawler - Scrapy
Java Script Based Client Server Webapps 2
快快樂樂學 Scrapy
RESTful API Design Fundamentals
XPath for web scraping
Web Crawling Modeling with Scrapy Models #TDC2014
Pydata-Python tools for webscraping
Python, web scraping and content management: Scrapy and Django
Website vs web app
電腦不只會幫你選土豆,還會幫你選新聞
均一Gae甘苦談
Mobile Website vs Mobile App
Web Engineering - Web Applications versus Conventional Software
Ad

Similar to Scrapy (20)

PPTX
Web scraping using scrapy - zekeLabs
PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
PPTX
Practical webcrawling with scrapy
PPTX
Practical webcrawling with scrapy
PDF
Scrapy tutorial
PDF
Getting started with Scrapy in Python
PDF
How To Crawl Amazon Website Using Python Scrapy.pdf
PPTX
How To Crawl Amazon Website Using Python Scrap (1).pptx
PDF
Getting started with Web Scraping in Python
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
PDF
Scrapy (1).pdf
PDF
Scrapy talk at DataPhilly
PDF
Web scraping in python
PPTX
Web_Scraping_Presentation_today pptx.pptx
PPTX
How to scraping content from web for location-based mobile app.
PPTX
Scrapinghub Deck for Startups
PPTX
Web programming using python frameworks.
PDF
Web Scraping Workshop
PDF
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.pdf
Web scraping using scrapy - zekeLabs
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Practical webcrawling with scrapy
Practical webcrawling with scrapy
Scrapy tutorial
Getting started with Scrapy in Python
How To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrap (1).pptx
Getting started with Web Scraping in Python
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Scrapy (1).pdf
Scrapy talk at DataPhilly
Web scraping in python
Web_Scraping_Presentation_today pptx.pptx
How to scraping content from web for location-based mobile app.
Scrapinghub Deck for Startups
Web programming using python frameworks.
Web Scraping Workshop
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.pdf

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx

Scrapy

  • 1. DRAFT VERSION v0.1 First steps with Scrapy @Francisco Sousa
  • 3. Scrapy is an open source and collaborative framework for extracting the data you need from websites. It’s made in Python!
  • 4. Who is it for?
  • 5. Scrapy is for everyone that want to collect data from one or many websites.
  • 6. “The advantage of scraping is that you can do it with virtually any web site - from weather forecasts to government spending, even if that site does not have an API for raw data access” Friedrich Lindenberg
  • 8. There are many alternatives as: • Lxml • Beatiful Soup • Mechanize • Newspaper
  • 10. • It’s free • It’s cross platform (Windows, Linux, Mac OS and BSD) • Fast and powerfull
  • 12. • It’s only for python 2.7.+ • It’s has a bigger learnig curve that some other alternatives • Installation it’s different according the operating system
  • 14. First of all you will have to install it so do: pip install scrapy or sudo pip install scrapy Note: with this command will be installed scrapy and their dependencies. On Windows you will have to install pywin32
  • 15. Create our first project
  • 16. Before we starting scraping information, we will create an scrapy project, so go to directory where you want to create the project and write the follow command: scrapy startproject demo
  • 17. The command before will create the skeleton for your project, as you can see on the figure bellow:
  • 18. The files created are the core of our project, so it’s important that you understand the basics: • scrapy.cfg: the project configuration file • demo/: the project’s python module, you’ll later import your code from here. • demo/items.py: the project’s items file. • demo/pipelines.py: the project’s pipelines file. • demo/settings.py: the project’s settings file. • demo/spiders/: a directory where you’ll later put your spiders.
  • 19. Choose an Website to scrape
  • 20. After we have the skeleton of the project, the next logical step is choose among the number of websites in the world, what is website that we want get information
  • 21. I choose for this example scrape information from the website: That is an important website of technology news
  • 22. Because the verge is a giant website, I decide that I will only try to get information from the last reviews of The Verge. So we have to follow the next steps: 1 See what is the url for reviews 2 Define how many pages we want to get of reviews 3 Define what information to scrape 4 Create a spider
  • 23. See what is the url for reviews http://www.theverge.com/reviews
  • 24. Define how many pages we want to get of reviews. For simplicity we will choose scrape only the first 5 pages of The Verge • http://www.theverge.com/reviews/1 • http://www.theverge.com/reviews/2 • http://www.theverge.com/reviews/3 • http://www.theverge.com/reviews/4 • http://www.theverge.com/reviews/5
  • 25. Define what information you want to scrape:
  • 26. 3 1 2 1 Title of the article 2 Number of comments 3 Author of the article
  • 27. Create the fields for the information that you want to scrape on Python
  • 30. name: identifies the Spider. It must be unique! start_urls: is a list of URLs where the Spider will begin to crawl from. parse: is a method of the spider, which will be called with the downloaded Response object of each start URL..
  • 31. How to run my spider?
  • 32. This is the easy part, to run our spider we have to simple to the following command: scrapy runspider <spider_file.py> E.g: scrapy runspider the_verge.py
  • 33. How to store information of my spider on a file?
  • 34. To store the information of our spider we have to execute the following command: scrapy runspider the_verge.py -o items.json
  • 35. You have other formats like CSV and XML: CSV: scrapy runspider the_verge.py -o items.csv XML: scrapy runspider the_verge.py -o items.xml
  • 37. In this presentation you learn the concepts key of scrapy and how to create a simple spider. Now is time to put hands to work and experiment other things :D
  • 40. Bibliography http://datajournalismhandbook.org/1.0/e n/getting_data_3.html https://pypi.python.org/pypi/Scrapy http://scrapy.org/ http://doc.scrapy.org/
  • 41. Code available in: https://github.com/FranciscoSousaDeveloper/demo Contact: pt.linkedin.com/pub/francisco-sousa/4a/921/6a3/ @Francisco Sousa

Editor's Notes

  • #27: Colocar em dois slides
  • #28: Colocar em dois slides