SlideShare a Scribd company logo
2
Most read
4
Most read
7
Most read
Scraping Data from the Web using
Scrapy & Beautiful Soup
Nithish Raghunandanan
nithishr@gmail.com
PyData Munich | 8th November 2017
About Me
● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
What is Scraping?
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
Use Cases
Tools for Scraping
● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Tutorial on Web Scraping in Python
Scraping 101
● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
Pitfalls in Crawling
● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas
Why Yellow Pages?
Email Marketing for Customer Acquisition
Email Marketing for Customer Acquisition
Initial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap
nithishr1
@nithishr
nithishr@gmail.com
Connect
Nithish Raghunandanan
www.ki-labs.com
Resources
● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping

More Related Content

PPTX
Web Scraping using Python | Web Screen Scraping
PPTX
Web Scrapping Using Python
ODP
Introduction to Web Scraping using Python and Beautiful Soup
PDF
What is Web-scraping?
PDF
What is web scraping?
PPT
Web Scraping and Data Extraction Service
PDF
Intro to web scraping with Python
PDF
Web scraping in python
Web Scraping using Python | Web Screen Scraping
Web Scrapping Using Python
Introduction to Web Scraping using Python and Beautiful Soup
What is Web-scraping?
What is web scraping?
Web Scraping and Data Extraction Service
Intro to web scraping with Python
Web scraping in python

What's hot (20)

PDF
Getting started with Web Scraping in Python
PPTX
Web Scraping With Python
PPTX
Web scraping
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
WEB Scraping.pptx
PPTX
Web scraping & browser automation
PDF
Representation Learning of Text for NLP
PDF
Skillshare - Introduction to Data Scraping
PPTX
Web Scraping Basics
PPTX
Ranking algorithms
PDF
A Basic Django Introduction
PPTX
Programming in Spark using PySpark
PPTX
Web mining
PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
PPTX
PDF
Web Scraping
PPTX
Django PPT.pptx
PDF
Scraping data from the web and documents
PPT
Pagerank Algorithm Explained
PDF
Elk - An introduction
Getting started with Web Scraping in Python
Web Scraping With Python
Web scraping
Introduction to Machine Learning with SciKit-Learn
WEB Scraping.pptx
Web scraping & browser automation
Representation Learning of Text for NLP
Skillshare - Introduction to Data Scraping
Web Scraping Basics
Ranking algorithms
A Basic Django Introduction
Programming in Spark using PySpark
Web mining
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
Web Scraping
Django PPT.pptx
Scraping data from the web and documents
Pagerank Algorithm Explained
Elk - An introduction
Ad

Viewers also liked (9)

ODP
Linux Introduction (Commands)
PPT
Hadoop introduction 2
PDF
Scraping the web with python
PDF
Linux File System
PPTX
Linux.ppt
PPTX
Big Data & Hadoop Tutorial
PDF
Web Scraping with Python
PPTX
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Linux Introduction (Commands)
Hadoop introduction 2
Scraping the web with python
Linux File System
Linux.ppt
Big Data & Hadoop Tutorial
Web Scraping with Python
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Ad

Similar to Tutorial on Web Scraping in Python (20)

PDF
Web Scraping Workshop
PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
PPTX
Web scraping using scrapy - zekeLabs
PDF
Getting started with Scrapy in Python
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
PDF
Web Scraping in Python with Scrapy
PPTX
Web programming using python frameworks.
PPTX
DOCX
Unit 2_Crawling a website data collection, search engine indexing, and cybers...
PDF
Scrapy talk at DataPhilly
PDF
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.pdf
PDF
Scrapinghub PyCon Philippines 2015
PPTX
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.ppt ...
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
PPTX
Scrappy
PDF
Pydata-Python tools for webscraping
PPTX
Web scrapping and how to do it using python.pptx
PPTX
Web_Scraping_Presentation_today pptx.pptx
PPTX
Scrapinghub Deck for Startups
Web Scraping Workshop
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping using scrapy - zekeLabs
Getting started with Scrapy in Python
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Web Scraping in Python with Scrapy
Web programming using python frameworks.
Unit 2_Crawling a website data collection, search engine indexing, and cybers...
Scrapy talk at DataPhilly
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.pdf
Scrapinghub PyCon Philippines 2015
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.ppt ...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Scrappy
Pydata-Python tools for webscraping
Web scrapping and how to do it using python.pptx
Web_Scraping_Presentation_today pptx.pptx
Scrapinghub Deck for Startups

More from Nithish Raghunandanan (12)

PDF
Evaluating the Effectiveness of RAG in Real World Applications
PDF
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
PDF
Next Generation Apps: Enhancing User Experience with LLMs.pdf
PDF
Select ML from Databases.pdf
PDF
Select ML from Databases
PDF
Virtual tourism in covid times
PDF
Life of a data engineer
PDF
Creating data apps using Streamlit in Python
PDF
Learnings from Organizing Internal Hackathons
PDF
Learnings from Organizing an Internal Hackathon
PDF
Pecha kucha Talk on web scraping
PDF
Hodor: Solving Everyday Problems with Tech
Evaluating the Effectiveness of RAG in Real World Applications
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Select ML from Databases.pdf
Select ML from Databases
Virtual tourism in covid times
Life of a data engineer
Creating data apps using Streamlit in Python
Learnings from Organizing Internal Hackathons
Learnings from Organizing an Internal Hackathon
Pecha kucha Talk on web scraping
Hodor: Solving Everyday Problems with Tech

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
NewMind AI Monthly Chronicles - July 2025
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf

Tutorial on Web Scraping in Python

  • 1. Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan nithishr@gmail.com PyData Munich | 8th November 2017
  • 2. About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr
  • 3. What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs
  • 5. Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse
  • 7. Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots
  • 8. Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas
  • 9. Why Yellow Pages? Email Marketing for Customer Acquisition
  • 10. Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap
  • 12. Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping