Session 03 acquiring data

Acquiring Data
Data Science for Beginners, Session 3

Session 3: your 5-7 things
• Finding development data
• Data filetypes
• Using an API
• PDF scrapers
• Web Scrapers
• Getting data ready for science

Data
• Data files (CSV, Excel, Json, Xml...)
• Databases (sqlite, mysql, oracle, postgresql...)
• APIs
• Report tables (tables on websites, in pdf reports...)
• Text (reports and other documents…)
• Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
• Images (satellite images, drone footage, pictures, videos…)

Data Sources
• data warehouses and catalogues
• open government data
• NGO websites
• web searches
• online documents, images, maps etc
• people you know who might have data

Creating your own data: People

Creating your own data: Sensors

Be cynical about your data
• Is the data relevant to your problem?
• Where did this data come from?
– Who collected it?
– Why? What for?
– Do they have biases that might show up in the data?
• Are there holes in the data (demographic, geographical, political etc)?
• Do you have supporting data? Is it *really* from a different source?

Some Data Types
• Structured data:
– Tables (e.g. CSVs, Excel tables)
– Relational data (e.g. json, xml, sqlite)
• Unstructured data:
– Free-text (e.g. Tweets, webpages etc)
• Maps and images:
– Vector data (e.g. shapefiles)
– Raster data (e.g geotiffs)
– Images

CSVs
• Comma-separated values
• Lots of commas
• Sometimes tab-separated (TSVs)
• Most applications read CSVs

Json
• JavaScript Object Notation
• Lots of braces { }
• Structured, i.e. not always row-by-column
• Many APIs output JSON
• Not all applications read JSON

XML
• eXtensible Markup Language
• Lots of brackets < >
• Structured, i.e. not always row-by-column
• Some applications read XML
• HTML is a form of XML

APIs
• “Application Programming Interface”
• A way for one computer application to ask
another one for a service
–Usually “give me this data”
–Sometimes “add this to your datasets”

RESTful APIs
http://api.worldbank.org/countries/all/indicators/SP.RUR.TO
TL.ZS?date=2000:2015&format=csv
• Base URL: api.worldbank.org
• What you’re asking for:
countries/all/indicators/SP.RUR.TOTL.ZA
• Details: date=2000:2015, format=csv

curl -X GET <URL>
Using CURL on the command-line

Do this: try these URLs
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=csv
.TOTL.ZS?date=2000:2015&format=json
.TOTL.ZS?date=2000:2015&format=xml

the Python Requests library
import requests
import json
worldbank_url =
"http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:20
15&format=json"
r = requests.get(worldbank_url)
jsondata = json.loads(r.text)
print(jsondata[1])

Request errors
r.status_code =
• 200: okay
• 400: bad request
• 401: unauthorised
• 404: page not found

Requests with a password
import requests
r = requests.get('https://api.github.com/user',
auth=('yourgithubname', ‘yourgithubpassword'))
dataset = r.text

Scraping
• Data in files and webpages that’s easy for
humans to read, but difficult for machines
• Don’t scrape unless you have to
–Small dataset: type it in!
–Larger dataset: Look for datasets and APIs online

Development data is often in PDFs

Some PDFs can be Scraped
• Open the PDF file in Acrobat
• Can you cut-and-paste text in the file?
–Y:
• use a PDF scraper
–N:

PDF Table Scrapers
• Cut and paste to Excel
• Tabula: free, open source, offline
• Pdftables: not free, online
• CometDocs: free, online

Design First!
What do you need to scrape?
● Which data values
● From which formats (html table, excel, pdf etc)
Do you need to maintain this?
● Is dataset regularly updated, or is once enough?
● How will you make updated data available to other people?
● Who could edit your code next year (if needed)?

Using Google Spreadsheets
• Open a google spreadsheet
• Put this into cell A1:
=importHtml("http://en.wikipedia.org/wiki/List_of_U.S._stat
es_and_territories_by_population", "table", 2)

Web scraping in Python
● Webpage-grabbing libraries:
o requests
o mechanize
o cookielib
● Element-finding libraries:
o beautifulsoup

Unpicking HTML with Python
url =
"https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population”
import requests
from bs4 import BeautifulSoup
html = requests.get(url)
bsObj = BeautifulSoup(html.text)
tables = bsObj.find_all('table’)
tables[0].find("th")

Getting data ready for
science

Changing Data Formats
• Conversion websites
• Code:
import pandas as pd
df = pd.read_json(“myfilename1.json”)
df.write_csv(“myfilename2.csv”)

Books
• "Web Scraping with Python: Collecting Data from the
Modern Web", O'Reilly

Prepare for next week
• Install Tableau
–See install instructions file

Prepare data
• Use your problem statement to look for datasets - what do
you need to answer your questions?
• If you can, convert your data into normalised CSV files
• Think about your data gaps - how can you fill them?

Session 03 acquiring data

More Related Content

What's hot (20)

Similar to Session 03 acquiring data (20)

More from Sara-Jayne Terp (20)

Recently uploaded (20)

Session 03 acquiring data

Editor's Notes