Parallelize data processing - 2023-10-24

Parallelize data processing
@PyDataVenice #13 #Meetup #PyData

Alessandra Bilardi
Data & Automation Specialist
● AWS User Group Venezia member
● Coderdojo member
● PyData Venezia member
alessandra.bilardi@gmail.com
@abilardi
bilardi

Agenda
Goal
Challenge
Solution
Results
Best practices

Challenge
https://h2oai.github.io/db-benchmark/

Challenge
https://www.wrighters.io/options-to-run-pandas-dataframe-apply-in-parallel/

Solution
https://h2oai.github.io/db-benchmark/

Solution
Goal:
performance
Challenge:
on 4 CPU & 8GB RAM
● try.sh
○ for each dataset
■ run.py
● run.py
○ for each library
■ importlib
■ main()
● main()
○ for each method
■ timeit before/after
■ print timing
■ gc.collect()

Datasets - rows
Goal:
execution timing
Challenge:
resources
● 10
● 50
● 100
● 500
● 1.000
● 5.000
● 10.000
● 50.000
● 100.000
● 500.000
● 1.000.000
● 5.000.000
● 10.000.000

Libraries
Goal:
learning curve
Challenge:
the same behaviours
● concurrent
● dask
● joblib
○ joblib on dask
● modin
○ modin on dask
● multiprocessing
● multiprocesspandas
● pandarallel
● pandas
● parallelize
● pyspark
● swifter

Results
$ grep "hlp.get_pd_sample()" *
concurrent_actions.py: sample = hlp.get_pd_sample()
joblib_actions.py: sample = hlp.get_pd_sample()
joblibdask_actions.py: sample = hlp.get_pd_sample()
multiprocessing_actions.py: sample = hlp.get_pd_sample()
multiprocesspandas_actions.py: sample = hlp.get_pd_sample()
pandarallel_actions.py: sample = hlp.get_pd_sample()
pandas_actions.py:sample = hlp.get_pd_sample()
parallelize_actions.py: sample = hlp.get_pd_sample()
swifter_actions.py: sample = hlp.get_pd_sample()

Segmenti casuali di codice o è qualcosa di più ..

Oddities
Goal:
performance
Challenge:
on 4 CPU & 8GB RAM
● pandas takes longer
○ than libs who use it
● pandas generate more
○ pids & threads than
pyspark
● modin generate more
○ pids & threads than
others
● modin works better if ..

Best practies
Goal:
performance
Challenge:
on 4 CPU & 8GB RAM
● vectorialization
○ better than apply
● multiprocessing
● pay attention
○ df.copy()
○ groupby / sum / apply

Thanks for listening.
@PyDataVenice #13 #Meetup #PyData

Parallelize data processing - 2023-10-24

More Related Content

Similar to Parallelize data processing - 2023-10-24 (20)

More from Alessandra Bilardi (20)

Recently uploaded (20)

Parallelize data processing - 2023-10-24