SlideShare a Scribd company logo
An Introduction to Web Scraping
with Python and DataCamp
Olga Scrivner, Research Scientist, CNS, CEWIT
WIM, February 23, 2018
0
Objectives
Materials: DataCamp.com
Review: Importing files
Accessing Web
Review: Processing text
Practice, practice, practice!
1
Credits
Hugo Bowne-Anderson - Importing Data in Python (Part 1
and Part 2)
Jeri Wieringa - Intro to Beautiful Soup
2
Importing Files
File Types: Text
Text files are structured as a sequence of lines
Each line includes a sequence of characters
Each line is terminated with a special character End of Line
4
Special Characters: Review
5
Special Characters: Answers
6
Modes
Reading Mode
◦ ‘r’
Writing Mode
◦ ‘w’
7
Modes
Reading Mode
◦ ‘r’
Writing Mode
◦ ‘w’
Quiz question: Why do we use quotes with ‘r’ and ‘w’?
7
Modes
Reading Mode
◦ ‘r’
Writing Mode
◦ ‘w’
Quiz question: Why do we use quotes with ‘r’ and ‘w’?
Answer: ‘r’ and ‘w’ are one-character strings
7
Open - Close
Open File - open(name, mode)
◦ name = ’filename’
◦ mode = ’r’ or mode = ’w’
8
Open New File
9
Open New File
9
Read File
Read the Entire File - filename.read()
Read ONE Line - filename.readline()
- Return the FIRST line
- Return the THIRD line
Read lines - filename.readlines()
10
Read File
Read the Entire File - filename.read()
Read ONE Line - filename.readline()
- Return the FIRST line
- Return the THIRD line
Read lines - filename.readlines()
What type of object and what is the length of this object?
10
Python Libraries
Import Modules (Libraries)
Beautiful Soup
urllib
More in next slides ...
For installation - https://programminghistorian.org/lessons/
intro-to-beautiful-soup
12
Review: Module I
To use external functions (modules), we need to import them:
1. Declare it at the top of the code
2. Use import
3. Call the module
13
Review: Modules II
To refer and import a specific function from the module
1. Declare it at the top pf the code
2. Use from import
3. Call the randint function from random module:
random.randint()
14
How to Import Packages with Modules
1. Install via a terminal or console
◦ Type command prompt in window search
◦ Type terminal in Mac search
15
How to Import Packages with Modules
1. Install via a terminal or console
◦ Type command prompt in window search
◦ Type terminal in Mac search
2. Check your Python Version
3. Click return/enter
15
Python 2 (pip) or Python 3 (pip3)
pip or pip3 - a tool for installing Python packages
To check if pip is installed:
https://packaging.python.org/tutorials/installing-packages/
16
Web Scraping Workflow
Web Concept
1. Import the necessary modules (functions)
2. Specify URL
3. Send a REQUEST
4. Catch RESPONSE
5. Return HTML as a STRING
6. Close the RESPONSE
18
URLs
19
URLs
1. URL - Uniform/Universal Resource Locator
2. A URL for web addresses consists of two parts:
2.1 Protocol identifier - http: or https:
2.2 Resource name - datacamp.com
19
URLs
1. URL - Uniform/Universal Resource Locator
2. A URL for web addresses consists of two parts:
2.1 Protocol identifier - http: or https:
2.2 Resource name - datacamp.com
3. HTTP - HyperText Transfer Protocol
4. HTTPS - more secure form of HTTP
5. Going to a website = sending HTTP request (GET request)
6. HTML - HyperText Markup Language
19
URLLIB package
Provide interface for getting data across the web. Instead of file
names we use URLS
Step 1 Install the package urllib (pip install urllib)
Step 2 Import the function urlretrieve - to RETRIEVE urls
during the REQUEST
Step 3 Create a variable url and provide the url link
url = ‘https:somepage’
Step 4 Save the retrieved document locally
Step 5 Read the file
20
Your Turn - DataCamp
DataCamp.com - create a free account using IU email
1. Log in
2. Select Groups
3. Select RBootcampIU - see Jennifer if you do not see it
4. Go to Assignments and select Importing Data in Python
21
Today’s Practice
22
Importing Flat Files
urlretrieve has two arguments: url (input) and file name
(output)
Example: urlretrieve(url, ‘file.name’)
23
Importing Flat Files
24
Opening and Reading Files
read_csv has two arguments: url and sep (separator)
pd.head()
25
Opening and Reading Files
read_csv has two arguments: url and sep (separator)
pd.head()
26
Importing Non-flat Files
read_excel has two arguments: url and sheetname
To read all sheets, sheetname = None
Let’s use a sheetname ’1700’
27
Importing Non-flat Files
28
HTTP Requests
read_excel has two arguments: url and sheetname
To read all sheets, sheetname = None
Let’s use a sheetname ’1700’
29
GET request
Import request package
30
HTTP with urllib
31
HTTP with urllib
32
Print HTTP with urllib
Use response.read()
33
Print HTTP with urllib
34
Return Web as a String
Use r.text
35
Return Web as a String
36
Scraping Web - HTML
37
Scraping Web - HTML
37
Scraping Web - BeautifulSoup Workflow
38
Many Useful Functions
soup.title
soup.get_text()
soup.find_all(’a’)
39
Parsing HTML with BeautifulSoup
40
Parsing HTML with BeautifulSoup
41
Turning a Webpage into Data with BeautifulSoup
soup.title
soup.get_text() 42
Turning a Webpage into Data with BeautifulSoup
43
Turning a Webpage into Data - Hyperlinks
HTML tag - <a>
find_all(’a’)
Collect all href: link.get(’href’)
44
Turning a Webpage into Data - Hyperlinks
45

More Related Content

PPTX
Multithreading in java
PPTX
Input output files in java
PPTX
Basic of java
DOC
Cyb 225 cyb225 cyb 225 best tutorials guide uopstudy.com
PDF
Emily Stark at Stanford ACM Hackathon
DOC
PDF
Demystifying how imports work in Python
PPTX
R Class: Set up Social Media API
Multithreading in java
Input output files in java
Basic of java
Cyb 225 cyb225 cyb 225 best tutorials guide uopstudy.com
Emily Stark at Stanford ACM Hackathon
Demystifying how imports work in Python
R Class: Set up Social Media API

What's hot (10)

PPTX
Integration Group - Robot Framework
DOC
Uop pos 433 week 5 linux script worksheet new
PDF
Payloads and OCR with Solr
PPTX
DEVNET-1001 Coding 101: How to Call REST APIs from a REST Client and Python
PPTX
Five steps to get tweets sent by a list of users
PPTX
Curiosity Bits Tutorial: Mining Twitter User Profile on Python V2
PPTX
Python Tutorial-Mining imgur images
PPT
Learn REST API with Python
PPTX
PPT
Byte stream classes.49
Integration Group - Robot Framework
Uop pos 433 week 5 linux script worksheet new
Payloads and OCR with Solr
DEVNET-1001 Coding 101: How to Call REST APIs from a REST Client and Python
Five steps to get tweets sent by a list of users
Curiosity Bits Tutorial: Mining Twitter User Profile on Python V2
Python Tutorial-Mining imgur images
Learn REST API with Python
Byte stream classes.49
Ad

Similar to Introduction to Web Scraping with Python (20)

PDF
Intro to Web Development Using Python and Django
PDF
Mastering Python Network Automation Tim Peters
DOCX
Akash rajguru project report sem v
PDF
Mastering Python Network Automation Automating Container Orchestration Config...
PDF
Mastering Python Network Automation Tim Peters
PDF
Django Introduction & Tutorial
PDF
WebWork in Action In Action First Edition Patrick Lightbody
PPTX
Robot framework
PPTX
Data-Analytics using python (Module 4).pptx
PDF
(eTextbook PDF) for Starting Out with Java: From Control Structures through O...
PDF
Object-Oriented Python 1st Edition Irv Kalb
PDF
WebWork in Action In Action First Edition Patrick Lightbody
PDF
Tutorial_Python1.pdf
PDF
Angular 2 overview in 60 minutes
PDF
Java How To Program Fourth Edition Harvey M. Deitel
PPTX
Migrating from MongoDB to Neo4j - Lessons Learned
PDF
Flamingo Hello World Tutorial
PDF
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
PDF
Testdriven Development With Python 1st Edition Harry J W Percival
PDF
WebWork in Action In Action First Edition Patrick Lightbody
Intro to Web Development Using Python and Django
Mastering Python Network Automation Tim Peters
Akash rajguru project report sem v
Mastering Python Network Automation Automating Container Orchestration Config...
Mastering Python Network Automation Tim Peters
Django Introduction & Tutorial
WebWork in Action In Action First Edition Patrick Lightbody
Robot framework
Data-Analytics using python (Module 4).pptx
(eTextbook PDF) for Starting Out with Java: From Control Structures through O...
Object-Oriented Python 1st Edition Irv Kalb
WebWork in Action In Action First Edition Patrick Lightbody
Tutorial_Python1.pdf
Angular 2 overview in 60 minutes
Java How To Program Fourth Edition Harvey M. Deitel
Migrating from MongoDB to Neo4j - Lessons Learned
Flamingo Hello World Tutorial
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
Testdriven Development With Python 1st Edition Harry J W Percival
WebWork in Action In Action First Edition Patrick Lightbody
Ad

More from Olga Scrivner (20)

PPTX
Engaging Students Competition and Polls.pptx
PPTX
HICSS ATLT: Advances in Teaching and Learning Technologies
PDF
The power of unstructured data: Recommendation systems
PPTX
Cognitive executive functions and Opioid Use Disorder
PDF
Call for paper Collaboration Systems and Technology
PDF
Jupyter machine learning crash course
PDF
R and RMarkdown crash course
PDF
The Impact of Language Requirement on Students' Performance, Retention, and M...
PPTX
If a picture is worth a thousand words, Interactive data visualizations are w...
PPTX
Introduction to Interactive Shiny Web Application
PDF
Introduction to Overleaf Workshop
PDF
R crash course for Business Analytics Course K303
PDF
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
PDF
Gender Disparity in Employment and Education
PDF
CrashCourse: Python with DataCamp and Jupyter for Beginners
PDF
Optimizing Data Analysis: Web application with Shiny
PDF
Data Analysis and Visualization: R Workflow
PDF
Reproducible visual analytics of public opioid data
PPTX
Building Effective Visualization Shiny WVF
PPTX
Building Shiny Application Series - Layout and HTML
Engaging Students Competition and Polls.pptx
HICSS ATLT: Advances in Teaching and Learning Technologies
The power of unstructured data: Recommendation systems
Cognitive executive functions and Opioid Use Disorder
Call for paper Collaboration Systems and Technology
Jupyter machine learning crash course
R and RMarkdown crash course
The Impact of Language Requirement on Students' Performance, Retention, and M...
If a picture is worth a thousand words, Interactive data visualizations are w...
Introduction to Interactive Shiny Web Application
Introduction to Overleaf Workshop
R crash course for Business Analytics Course K303
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Gender Disparity in Employment and Education
CrashCourse: Python with DataCamp and Jupyter for Beginners
Optimizing Data Analysis: Web application with Shiny
Data Analysis and Visualization: R Workflow
Reproducible visual analytics of public opioid data
Building Effective Visualization Shiny WVF
Building Shiny Application Series - Layout and HTML

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
annual-report-2024-2025 original latest.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
Database Infoormation System (DBIS).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Fluorescence-microscope_Botany_detailed content
Qualitative Qantitative and Mixed Methods.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21

Introduction to Web Scraping with Python