Building a data processing pipeline in Python

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Building a data processing pipeline in Python
Joe Cabrera
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
PyGotham, 2015
Joe Cabrera Building a data processing pipeline in Python

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Outline
1 The problem
2 Data ingestion
3 Data parsing
4 Data cleansing
5 Scaling out

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Largely dispersed across the web

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
No standard data processing library
Pandas
Bubbles

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Data processing

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Requests and Futures
Requests makes it easy to send the required parameters
Concurrent Futures allows for the asynchronous execution
of download requests

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Parsers
Python tokenize
BeautifulSoup

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Why BeautifulSoup
More forgiving than standard XML or HTML libraries
Supports regex

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Celery job scheduling
Each download job is a task
Each parse job is a task
Each cleanse job is a task

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Re-insert cleansed data
Cleanup data after raw ingest
Separate stores for raw and clean data

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Distributed task queue
Distribute data processing jobs to many machines
Distribute jobs on a given machine across many CPUs

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
SQL-Alchemy basic sharding API
Each databases each has a shard id
We query for data based on which shard contains the data

The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Questions
Thanks!
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com

Building a data processing pipeline in Python

More Related Content

What's hot (18)

Viewers also liked (8)

Similar to Building a data processing pipeline in Python (20)

Recently uploaded (20)

Building a data processing pipeline in Python