SlideShare a Scribd company logo
Parallelize data processing
@PyDataVenice #13 #Meetup #PyData
Alessandra Bilardi
Data & Automation Specialist
● AWS User Group Venezia member
● Coderdojo member
● PyData Venezia member
alessandra.bilardi@gmail.com
@abilardi
bilardi
Agenda
Goal
Challenge
Solution
Results
Best practices
Goal
Goal
Goal
Goal
Challenge
Challenge
Challenge
Challenge
Challenge
may
open source
with you
Challenge
Challenge
https://h2oai.github.io/db-benchmark/
Challenge
https://www.wrighters.io/options-to-run-pandas-dataframe-apply-in-parallel/
Solution
Solution
https://h2oai.github.io/db-benchmark/
Solution
Solution
Goal:
performance
Challenge:
on 4 CPU & 8GB RAM
● try.sh
○ for each dataset
■ run.py
● run.py
○ for each library
■ importlib
■ main()
● main()
○ for each method
■ timeit before/after
■ print timing
■ gc.collect()
Datasets - rows
Goal:
execution timing
Challenge:
resources
● 10
● 50
● 100
● 500
● 1.000
● 5.000
● 10.000
● 50.000
● 100.000
● 500.000
● 1.000.000
● 5.000.000
● 10.000.000
Libraries
Goal:
learning curve
Challenge:
the same behaviours
● concurrent
● dask
● joblib
○ joblib on dask
● modin
○ modin on dask
● multiprocessing
● multiprocesspandas
● pandarallel
● pandas
● parallelize
● pyspark
● swifter
run_libraries.py
run_libraries.py
run_libraries.py
main()
main()
Solution main()
Results
Results
Results
Results
$ grep "hlp.get_pd_sample()" *
concurrent_actions.py: sample = hlp.get_pd_sample()
joblib_actions.py: sample = hlp.get_pd_sample()
joblibdask_actions.py: sample = hlp.get_pd_sample()
multiprocessing_actions.py: sample = hlp.get_pd_sample()
multiprocesspandas_actions.py: sample = hlp.get_pd_sample()
pandarallel_actions.py: sample = hlp.get_pd_sample()
pandas_actions.py:sample = hlp.get_pd_sample()
parallelize_actions.py: sample = hlp.get_pd_sample()
swifter_actions.py: sample = hlp.get_pd_sample()
Results
Results
Results
✅
❌
Results
Results
Results
dask modindask
Results
dask modindask
Results
dask modindask
Results
Results
Results
Results
Solution main()
Results
Results
Segmenti casuali di codice o è qualcosa di più ..
Results
Results
Results
Results
63.488
84.034
may
data power
with you
Oddities
Goal:
performance
Challenge:
on 4 CPU & 8GB RAM
● pandas takes longer
○ than libs who use it
● pandas generate more
○ pids & threads than
pyspark
● modin generate more
○ pids & threads than
others
● modin works better if ..
Best practies
Goal:
performance
Challenge:
on 4 CPU & 8GB RAM
● vectorialization
○ better than apply
● multiprocessing
● pay attention
○ df.copy()
○ groupby / sum / apply
Thanks for listening.
@PyDataVenice #13 #Meetup #PyData

More Related Content

PDF
Pandas/Data Analysis at Baypiggies
PDF
High Performance Python 2nd Edition Micha Gorelick
PDF
Parallel computing in Python: Current state and recent advances
PDF
Fast and Scalable Python
PDF
Joblib: Lightweight pipelining for parallel jobs (v2)
PDF
PyData Paris 2015 - Track 3.1 Niels Zeilemaker
PDF
Embarrassingly parallel database calls with Python (PyData Paris 2015 )
PDF
(Ebook) High Performance Python by Micha Gorelick, Ian Ozsvald
Pandas/Data Analysis at Baypiggies
High Performance Python 2nd Edition Micha Gorelick
Parallel computing in Python: Current state and recent advances
Fast and Scalable Python
Joblib: Lightweight pipelining for parallel jobs (v2)
PyData Paris 2015 - Track 3.1 Niels Zeilemaker
Embarrassingly parallel database calls with Python (PyData Paris 2015 )
(Ebook) High Performance Python by Micha Gorelick, Ian Ozsvald

Similar to Parallelize data processing - 2023-10-24 (20)

PDF
High Performance Python 2nd Edition Micha Gorelick Ian Ozsvald
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
PDF
Alta performance com Python
PPTX
Apache spark
PDF
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
PDF
PYSPARK PROGRAMMING.pdf
PPTX
Dc python meetup
PPTX
Lrz kurs: big data analysis
PDF
Optimizing Python
PPTX
Dask: Scaling Python
PDF
Processing biggish data on commodity hardware: simple Python patterns
PDF
Lightning Fast Dataframes with Polars
PDF
PyTables
PDF
Large Data Analyze With PyTables
PDF
PyTables
PDF
Py tables
PDF
Pyspark training | Introduction to PySpark DataFrames
PDF
Parallelism in a NumPy-based program
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PDF
Multiprocessing with python
High Performance Python 2nd Edition Micha Gorelick Ian Ozsvald
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Alta performance com Python
Apache spark
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
PYSPARK PROGRAMMING.pdf
Dc python meetup
Lrz kurs: big data analysis
Optimizing Python
Dask: Scaling Python
Processing biggish data on commodity hardware: simple Python patterns
Lightning Fast Dataframes with Polars
PyTables
Large Data Analyze With PyTables
PyTables
Py tables
Pyspark training | Introduction to PySpark DataFrames
Parallelism in a NumPy-based program
python-pandas-For-Data-Analysis-Manipulate.pptx
Multiprocessing with python
Ad

More from Alessandra Bilardi (20)

PDF
Amazon Q and Amazon Bedrock, fully managed vs. custom - 2025-06-25
PDF
The Art of Data Visualization - 2025-05-31
PDF
Data Management on AWS: from caos to centralized governance - 2025-03-26
PDF
GenAI-powered assistants compared in a real case - 2025-03-18
PDF
Forecasting in AWS - 2025-01-25
PDF
Overview of Hugging Face platform - 2024-10-24
PDF
A gentle introduction to MLSecOps - 2024-10-11
PDF
Custom processing and modeling with Amazon SageMaker - 2024-09-26
PDF
Data scientist vs Cloud engineer: who wins ? - 2024-09-19
PDF
Custom processing and modeling with Amazon SageMaker - 2024-06-17
PDF
IoT: ingestion, streaming, real-time and interactive data analysis - 2024-05-29
PDF
MLOps vs LLMOps (by workflows and use cases) - 2024-05-21
PDF
How to analyze the data arriving from the IoT? - 2024-05-16
PDF
Overview of the OpenCV library and some use cases - 2024-04-19
PDF
How to move your ML system from local to production - 2024-03-15
PDF
Overview of the Kaggle platform and its competitions
PDF
Forecasting in AWS - 2024-02-01
PDF
From your laptop to all resource that you need - 2023-12-09
PDF
The Fourier transformation - 2023-07-23
PDF
Anomaly Detection and IP Insights - 2023-06-10
Amazon Q and Amazon Bedrock, fully managed vs. custom - 2025-06-25
The Art of Data Visualization - 2025-05-31
Data Management on AWS: from caos to centralized governance - 2025-03-26
GenAI-powered assistants compared in a real case - 2025-03-18
Forecasting in AWS - 2025-01-25
Overview of Hugging Face platform - 2024-10-24
A gentle introduction to MLSecOps - 2024-10-11
Custom processing and modeling with Amazon SageMaker - 2024-09-26
Data scientist vs Cloud engineer: who wins ? - 2024-09-19
Custom processing and modeling with Amazon SageMaker - 2024-06-17
IoT: ingestion, streaming, real-time and interactive data analysis - 2024-05-29
MLOps vs LLMOps (by workflows and use cases) - 2024-05-21
How to analyze the data arriving from the IoT? - 2024-05-16
Overview of the OpenCV library and some use cases - 2024-04-19
How to move your ML system from local to production - 2024-03-15
Overview of the Kaggle platform and its competitions
Forecasting in AWS - 2024-02-01
From your laptop to all resource that you need - 2023-12-09
The Fourier transformation - 2023-07-23
Anomaly Detection and IP Insights - 2023-06-10
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Database Infoormation System (DBIS).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
annual-report-2024-2025 original latest.
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
STUDY DESIGN details- Lt Col Maksud (21).pptx
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
Database Infoormation System (DBIS).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Analytics and business intelligence.pdf
1_Introduction to advance data techniques.pptx
Supervised vs unsupervised machine learning algorithms
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
annual-report-2024-2025 original latest.
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Parallelize data processing - 2023-10-24