SlideShare a Scribd company logo
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Building a data processing pipeline in Python
Joe Cabrera
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
PyGotham, 2015
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Outline
1 The problem
2 Data ingestion
3 Data parsing
4 Data cleansing
5 Scaling out
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Largely dispersed across the web
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
No standard data processing library
Pandas
Bubbles
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Data processing
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Requests and Futures
Requests makes it easy to send the required parameters
Concurrent Futures allows for the asynchronous execution
of download requests
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Parsers
Python tokenize
BeautifulSoup
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Why BeautifulSoup
More forgiving than standard XML or HTML libraries
Supports regex
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Celery job scheduling
Each download job is a task
Each parse job is a task
Each cleanse job is a task
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Re-insert cleansed data
Cleanup data after raw ingest
Separate stores for raw and clean data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Distributed task queue
Distribute data processing jobs to many machines
Distribute jobs on a given machine across many CPUs
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
SQL-Alchemy basic sharding API
Each databases each has a shard id
We query for data based on which shard contains the data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Questions
Thanks!
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
Joe Cabrera Building a data processing pipeline in Python

More Related Content

PDF
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PDF
Sustainable queryable access to Linked Data
PDF
Querying datasets on the Web with high availability
PDF
Linked Data Fragments
PPTX
Python for Big Data Analytics
PDF
Overview of GraphQL & Clients
PDF
The Lonesome LOD Cloud
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
On the Persistence of Persistent Identifiers of the Scholarly Web
Sustainable queryable access to Linked Data
Querying datasets on the Web with high availability
Linked Data Fragments
Python for Big Data Analytics
Overview of GraphQL & Clients
The Lonesome LOD Cloud

What's hot (18)

PDF
SQL: The one language to rule all your data
PPTX
How to Build a Semantic Search System
PDF
Fire-fighting java big data problems
PDF
How to Light a Beacon
PPTX
Building Search & Recommendation Engines
PPT
Aqua Browser Implementation at Oklahoma State University
PDF
Linking media, data, and services
PPTX
LinkedGov extension for Google Refine
PPTX
Python and BIG Data analytics | Python Fundamentals | Python Architecture
PDF
Logs & Visualizations at Twitter
PDF
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
PDF
Adventure in Data: A tour of visualization projects at Twitter
PPTX
Using server logs to your advantage
PPTX
The Intent Algorithms of Search & Recommendation Engines
PPTX
Hadoop with Python
PDF
Democratizing Data at Airbnb
SQL: The one language to rule all your data
How to Build a Semantic Search System
Fire-fighting java big data problems
How to Light a Beacon
Building Search & Recommendation Engines
Aqua Browser Implementation at Oklahoma State University
Linking media, data, and services
LinkedGov extension for Google Refine
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Logs & Visualizations at Twitter
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Adventure in Data: A tour of visualization projects at Twitter
Using server logs to your advantage
The Intent Algorithms of Search & Recommendation Engines
Hadoop with Python
Democratizing Data at Airbnb
Ad

Viewers also liked (8)

PDF
Pyxley: Easy Web Applications with Flask and React.js
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Functional Programming with Ruby
PDF
Large scale data processing pipelines at trivago
PPTX
How To Download and Process SEC XBRL Data Directly from EDGAR
PDF
Managing data workflows with Luigi
PDF
Bubbles – Virtual Data Objects
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Pyxley: Easy Web Applications with Flask and React.js
A Beginner's Guide to Building Data Pipelines with Luigi
Functional Programming with Ruby
Large scale data processing pipelines at trivago
How To Download and Process SEC XBRL Data Directly from EDGAR
Managing data workflows with Luigi
Bubbles – Virtual Data Objects
Building a Data Pipeline from Scratch - Joe Crobak
Ad

Similar to Building a data processing pipeline in Python (20)

PDF
Building Data Pipelines in Python
PDF
High Performance Python 2nd Edition Micha Gorelick
PDF
(Ebook) Data Science with Python by coll.
PDF
Building Data Apps with Python
PDF
High Performance Python 2nd Edition Micha Gorelick Ian Ozsvald
PDF
(Ebook) High Performance Python by Micha Gorelick, Ian Ozsvald
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Python in Industry
PPTX
Datascience
PDF
Data Science with Python 1st Edition Coll.
PPTX
The New York Times: Sustainable Systems, Powered by Python
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PDF
Exploring and Using the Python Ecosystem
PDF
High Performance Python Practical Performant Programming for Humans 2nd Editi...
PDF
Python Cookbook 1st Edition Alex Martelli
PDF
Interop 2015: Hardly Enough Theory, Barley Enough Code
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PDF
from ai.backend import python @ pycontw2018
PDF
Get Data Science with Python 1st Edition Coll. free all chapters
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Building Data Pipelines in Python
High Performance Python 2nd Edition Micha Gorelick
(Ebook) Data Science with Python by coll.
Building Data Apps with Python
High Performance Python 2nd Edition Micha Gorelick Ian Ozsvald
(Ebook) High Performance Python by Micha Gorelick, Ian Ozsvald
From Pipelines to Refineries: Scaling Big Data Applications
Python in Industry
Datascience
Data Science with Python 1st Edition Coll.
The New York Times: Sustainable Systems, Powered by Python
Introduction to Data Engineer and Data Pipeline at Credit OK
Exploring and Using the Python Ecosystem
High Performance Python Practical Performant Programming for Humans 2nd Editi...
Python Cookbook 1st Edition Alex Martelli
Interop 2015: Hardly Enough Theory, Barley Enough Code
More Data, More Problems: Evolving big data machine learning pipelines with S...
from ai.backend import python @ pycontw2018
Get Data Science with Python 1st Edition Coll. free all chapters
Dirty data? Clean it up! - Datapalooza Denver 2016

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Introduction to Business Data Analytics.
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
Introduction to Knowledge Engineering Part 1
Introduction-to-Cloud-ComputingFinal.pptx
Quality review (1)_presentation of this 21
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Miokarditis (Inflamasi pada Otot Jantung)
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Business Data Analytics.
1_Introduction to advance data techniques.pptx
Moving the Public Sector (Government) to a Digital Adoption
Lecture1 pattern recognition............

Building a data processing pipeline in Python

  • 1. The problem Data ingestion Data parsing Data cleansing Scaling out Building a data processing pipeline in Python Joe Cabrera https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com PyGotham, 2015 Joe Cabrera Building a data processing pipeline in Python
  • 2. The problem Data ingestion Data parsing Data cleansing Scaling out Outline 1 The problem 2 Data ingestion 3 Data parsing 4 Data cleansing 5 Scaling out Joe Cabrera Building a data processing pipeline in Python
  • 3. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 4. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 5. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 6. The problem Data ingestion Data parsing Data cleansing Scaling out Largely dispersed across the web Joe Cabrera Building a data processing pipeline in Python
  • 7. The problem Data ingestion Data parsing Data cleansing Scaling out No standard data processing library Pandas Bubbles Joe Cabrera Building a data processing pipeline in Python
  • 8. The problem Data ingestion Data parsing Data cleansing Scaling out Data processing Joe Cabrera Building a data processing pipeline in Python
  • 9. The problem Data ingestion Data parsing Data cleansing Scaling out Requests and Futures Requests makes it easy to send the required parameters Concurrent Futures allows for the asynchronous execution of download requests Joe Cabrera Building a data processing pipeline in Python
  • 10. The problem Data ingestion Data parsing Data cleansing Scaling out Parsers Python tokenize BeautifulSoup Joe Cabrera Building a data processing pipeline in Python
  • 11. The problem Data ingestion Data parsing Data cleansing Scaling out Why BeautifulSoup More forgiving than standard XML or HTML libraries Supports regex Joe Cabrera Building a data processing pipeline in Python
  • 12. The problem Data ingestion Data parsing Data cleansing Scaling out Celery job scheduling Each download job is a task Each parse job is a task Each cleanse job is a task Joe Cabrera Building a data processing pipeline in Python
  • 13. The problem Data ingestion Data parsing Data cleansing Scaling out Re-insert cleansed data Cleanup data after raw ingest Separate stores for raw and clean data Joe Cabrera Building a data processing pipeline in Python
  • 14. The problem Data ingestion Data parsing Data cleansing Scaling out Distributed task queue Distribute data processing jobs to many machines Distribute jobs on a given machine across many CPUs Joe Cabrera Building a data processing pipeline in Python
  • 15. The problem Data ingestion Data parsing Data cleansing Scaling out SQL-Alchemy basic sharding API Each databases each has a shard id We query for data based on which shard contains the data Joe Cabrera Building a data processing pipeline in Python
  • 16. The problem Data ingestion Data parsing Data cleansing Scaling out Questions Thanks! https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com Joe Cabrera Building a data processing pipeline in Python