SlideShare a Scribd company logo
2
Most read
4
Most read
10
Most read
WEB SCRAPING
Dmytro Nekh
- Data scraping
- Types of data scraping
- Web scraping
- Process of web scraping
Data scraping
Data scraping - is a technique in which a computer
program extracts data from human-readable output
coming from another program.
Types of data scraping
Screen scraping is the method of collecting screen display data from
one application and translating it so that another application is able to
display it.
Report mining is the extraction of data from human readable
computer reports.
Web scraping is a web technique of extracting data from the web, and
turning unstructured data on the web (including HTML formats) into
structured data that you can store to your local computer or a
database.
Types of data scraping
Screen scraping is the method of collecting screen display data from
one application and translating it so that another application is able to
display it.
Report mining is the extraction of data from human readable
computer reports.
Web scraping is a web technique of extracting data from the web, and
turning unstructured data on the web (including HTML formats) into
structured data that you can store to your local computer or a
database.
Manual scraping: Copy-paste technique
Text Pattern Matching
This is a regular expression-matching technique using the UNIX grep
command, and clubbed with popular programming languages
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
chunk = message[i:i+12]
if isPhoneNumber(chunk):
print('Phone number found: ' + chunk)
Computer vision web-page analysis
There are efforts using machine learning and
computer vision that attempt to identify and extract
information from web pages by interpreting pages
visually as a human being might.
Vertical Aggregation
Vertical aggregation platforms are created by companies with huge
computing power, targeting a specific verticals. Some even run these
data harvesting platforms on the cloud. Creation and monitoring of bots
for specific verticals is done by these platforms, with virtually no human
intervention. Since the bots are created automatically based on the
knowledge base for the specific vertical, the efficiency of the bots is
measured by the quality of data extracted.
HTML Parsing
HTML parsing is done using Java scripts, and targets linear or nested HTML pages. This fast and
robust method is used for text extraction, link extraction (for example, nested links or email
addresses), resource extraction, and so on.
DOM Parsing
Document Object Model, or
DOM, defines the style,
structure and the contents
contained within the XML
files. DOM parsers are
generally used by scrapers
that want to get an in-depth
view of the structure of the
web page. One can use the
DOM parser to get the nodes
containing information, and
then use a tool like XPath to
scrape web pages.
Simple DOM Parser
Simple DOM Parser
Tools for web scraping
- Selenium
- Import.io
- Phantom.js
- Scrapy
- etc.
Web scraping

More Related Content

PPTX
WEB Scraping.pptx
PDF
Web scraping in python
PDF
What is Web-scraping?
PDF
What is web scraping?
PPTX
portfolio management PPT
PPT
Chapter 13 software testing strategies
PPTX
Web Scraping Basics
PPTX
Power Bi Basics
WEB Scraping.pptx
Web scraping in python
What is Web-scraping?
What is web scraping?
portfolio management PPT
Chapter 13 software testing strategies
Web Scraping Basics
Power Bi Basics

What's hot (20)

PPT
Web Scraping and Data Extraction Service
PDF
Movie recommendation project
PPTX
Client side scripting and server side scripting
PPTX
Web Scraping
PPTX
ppt of web development for diploma student
PDF
Web Scraping
PPTX
The impact of web on ir
PPTX
Machine learning ppt
PPTX
Machine learning and types
PPTX
Vision of cloud computing
PPTX
Types of Machine Learning
PPTX
Web services
PPTX
Image classification with Deep Neural Networks
PPTX
Hate speech detection
PPT
Coupling and cohesion
PPTX
Machine learning
PPTX
Common Standards in Cloud Computing
PPTX
Big data ppt
PPTX
Lec 7 query processing
PDF
Data visualization in Python
Web Scraping and Data Extraction Service
Movie recommendation project
Client side scripting and server side scripting
Web Scraping
ppt of web development for diploma student
Web Scraping
The impact of web on ir
Machine learning ppt
Machine learning and types
Vision of cloud computing
Types of Machine Learning
Web services
Image classification with Deep Neural Networks
Hate speech detection
Coupling and cohesion
Machine learning
Common Standards in Cloud Computing
Big data ppt
Lec 7 query processing
Data visualization in Python
Ad

Similar to Web scraping (20)

PPTX
DATA SCRAPING AND WEB Scrapping.....pptx
PDF
What are the different types of web scraping approaches
PDF
Implementation of Web Application for Disease Prediction Using AI
PDF
Implementation ofWeb Application for Disease Prediction Using AI
PPTX
Web Scraping Services.pptx
PDF
IGCSE ICT Theory
PDF
Nadee2018
PDF
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
PDF
A Complete Guide to Data Extraction – Definition, How It Works and Examples
PDF
Unsupervised approach to deduce schema and extract data from template web pages
PDF
A language independent web data extraction using vision based page segmentati...
PDF
A language independent web data extraction using vision based page segmentati...
PDF
Vision Based Deep Web data Extraction on Nested Query Result Records
PDF
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING ...
PDF
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
PPTX
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
PPTX
Technical Comptency_ppt
PPTX
Web crawler with seo analysis
PPT
PeopleSoft
PDF
Improve your Tech Quotient
DATA SCRAPING AND WEB Scrapping.....pptx
What are the different types of web scraping approaches
Implementation of Web Application for Disease Prediction Using AI
Implementation ofWeb Application for Disease Prediction Using AI
Web Scraping Services.pptx
IGCSE ICT Theory
Nadee2018
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
A Complete Guide to Data Extraction – Definition, How It Works and Examples
Unsupervised approach to deduce schema and extract data from template web pages
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
Vision Based Deep Web data Extraction on Nested Query Result Records
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING ...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
Technical Comptency_ppt
Web crawler with seo analysis
PeopleSoft
Improve your Tech Quotient
Ad

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation

Web scraping

  • 2. - Data scraping - Types of data scraping - Web scraping - Process of web scraping
  • 3. Data scraping Data scraping - is a technique in which a computer program extracts data from human-readable output coming from another program.
  • 4. Types of data scraping Screen scraping is the method of collecting screen display data from one application and translating it so that another application is able to display it. Report mining is the extraction of data from human readable computer reports. Web scraping is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database.
  • 5. Types of data scraping Screen scraping is the method of collecting screen display data from one application and translating it so that another application is able to display it. Report mining is the extraction of data from human readable computer reports. Web scraping is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database.
  • 7. Text Pattern Matching This is a regular expression-matching technique using the UNIX grep command, and clubbed with popular programming languages message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.' for i in range(len(message)): chunk = message[i:i+12] if isPhoneNumber(chunk): print('Phone number found: ' + chunk)
  • 8. Computer vision web-page analysis There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.
  • 9. Vertical Aggregation Vertical aggregation platforms are created by companies with huge computing power, targeting a specific verticals. Some even run these data harvesting platforms on the cloud. Creation and monitoring of bots for specific verticals is done by these platforms, with virtually no human intervention. Since the bots are created automatically based on the knowledge base for the specific vertical, the efficiency of the bots is measured by the quality of data extracted.
  • 10. HTML Parsing HTML parsing is done using Java scripts, and targets linear or nested HTML pages. This fast and robust method is used for text extraction, link extraction (for example, nested links or email addresses), resource extraction, and so on.
  • 11. DOM Parsing Document Object Model, or DOM, defines the style, structure and the contents contained within the XML files. DOM parsers are generally used by scrapers that want to get an in-depth view of the structure of the web page. One can use the DOM parser to get the nodes containing information, and then use a tool like XPath to scrape web pages.
  • 14. Tools for web scraping - Selenium - Import.io - Phantom.js - Scrapy - etc.