SlideShare a Scribd company logo
Chetan Khatri, Data Science Practice Leader.
Accionlabs India.
PyconLT’19, May 26 - Vilnius Lithuania
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
M.Sc. - Computer Science from University of Kachchh, India.
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Koalas: pandas API on Apache Spark
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Alternative to spark default sort
● Why dropDuplicates() doesn’t result consistency, What is alternative
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation
Integer RDD
String or Text RDD
Double or Binary RDD
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action
Operations
Transformation
Action
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
DataFrames
Datasets
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
// convert RDD -> DF with column names
parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
// filter, groupBy, sum, and then agg()
parsedDF.filter(lambda x: x[1] === "finance")
.groupBy("sprint")
.agg(sum("numStories").as("count"))
.limit(100)
.show(100)
project sprint numStories
finance 3 20
finance 4 22
parsedDF.createOrReplaceTempView("audits")
results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
Source: Databricks
● Pandas - Analyze small datasets.
● Spark - Analyze large size of datasets.
Pandas DataFrame Spark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Add a Column df['Z'] = df['X'] +
df['Y']
df.withColumn('Z',
df['X'] + df['Y'])
Rename columns df.columns = ['X', 'Y'] df.select(df['Q1'].as('X'
), df['P1'].as('Y'))
Ref Example, https://github.com/chetkhatri/PyConLT2019
Executors
Cores
Containers
Stage
Job
Task
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
What happens when you run this code?
What would be the impact at Database engine side?
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
# check broadcast join threshold
>>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024
10
# logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
# Query plan
sampleDF.explain
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
./bin/spark-submit 
--conf spark.yarn.maxAppAttempts=1 
--name PyConLT19 
--master yarn 
--deploy-mode cluster 
--driver-memory 18g 
--executor-memory 24g 
--num-executors 4 
--executor-cores 6 
--conf spark.yarn.maxAppAttempts=1 
--conf spark.speculation=false 
--conf spark.broadcast.compress=true 
--conf spark.sql.broadcastTimeout=36000 
--conf spark.network.timeout=2500s 
--conf spark.dynamicAllocation.executorAllocationRatio=1 
--conf spark.executor.heartbeatInterval=30s 
--conf spark.dynamicAllocation.executorIdleTimeout=60s 
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s 
--conf spark.network.timeout=1200s 
--conf spark.dynamicAllocation.schedulerBacklogTimeout=15s 
--conf spark.yarn.maxAppAttempts=1 
--conf spark.shuffle.service.enabled=true 
--conf spark.dynamicAllocation.enabled=True 
--conf spark.dynamicAllocation.minExecutors=2 
--conf spark.dynamicAllocation.initialExecutors=2 
--conf spark.dynamicAllocation.maxExecutors=6 
examples/src/main/python/pi.py
[1] Koalas: pandas API on Apache Spark
[URL] https://github.com/databricks/koalas
[2] An open-source storage layer that brings scalable, ACID transactions
to Apache Spark™ and big data workloads. https://delta.io
[URL] https://github.com/delta-io/delta
[3] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running
Tasks.
https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-
and-re-schedule-slow-running-tasks/
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

More Related Content

PDF
Data Science with Spark
PPTX
Data engineering and analytics using python
PDF
2014.06.24.what is ubix
PDF
Spark and the Future of Advanced Analytics by Thomas Dinsmore
PDF
No more struggles with Apache Spark workloads in production
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
PDF
Running complex data queries in a distributed system
Data Science with Spark
Data engineering and analytics using python
2014.06.24.what is ubix
Spark and the Future of Advanced Analytics by Thomas Dinsmore
No more struggles with Apache Spark workloads in production
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Running complex data queries in a distributed system

What's hot (20)

PPTX
Topic modeling using big data analytics
PPTX
Presentation on data preparation with pandas
PDF
Agile data science with scala
PPTX
Analyzing Data With Python
PDF
Data Structures for Statistical Computing in Python
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
PDF
Towards a rebirth of data science (by Data Fellas)
PDF
MapR Data Analyst
PDF
Data Wrangling and Visualization Using Python
PDF
Briefing on the Modern ML Stack with R
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Spark - Philly JUG
PDF
Apache Spark Side of Funnels
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PPT
Mapreduce in Search
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
PDF
A look inside pandas design and development
PDF
Data science-toolchain
PPTX
Building a PII scrubbing layer
Topic modeling using big data analytics
Presentation on data preparation with pandas
Agile data science with scala
Analyzing Data With Python
Data Structures for Statistical Computing in Python
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Towards a rebirth of data science (by Data Fellas)
MapR Data Analyst
Data Wrangling and Visualization Using Python
Briefing on the Modern ML Stack with R
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Spark - Philly JUG
Apache Spark Side of Funnels
Distributed Deep Learning + others for Spark Meetup
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Mapreduce in Search
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
A look inside pandas design and development
Data science-toolchain
Building a PII scrubbing layer
Ad

Similar to PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production (20)

PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
An Introduction to Spark with Scala
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Fossasia 2018-chetan-khatri
PDF
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
PPTX
A Tale of Data Pattern Discovery in Parallel
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
PDF
Big Data Analytics and Ubiquitous computing
PDF
Strata NYC 2015 - What's coming for the Spark community
PDF
Apache spark - Architecture , Overview & libraries
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
How Apache Spark fits into the Big Data landscape
PPTX
Database Performance Tuning
PDF
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Jump Start into Apache® Spark™ and Databricks
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
An Introduction to Spark with Scala
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Fossasia 2018-chetan-khatri
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
A Tale of Data Pattern Discovery in Parallel
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
A look under the hood at Apache Spark's API and engine evolutions
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Big Data Analytics and Ubiquitous computing
Strata NYC 2015 - What's coming for the Spark community
Apache spark - Architecture , Overview & libraries
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Unified Big Data Processing with Apache Spark (QCON 2014)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
How Apache Spark fits into the Big Data landscape
Database Performance Tuning
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Jump Start into Apache® Spark™ and Databricks
Ad

More from Chetan Khatri (20)

PDF
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
PDF
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PDF
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
HBase with Apache Spark POC Demo
PDF
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
PDF
Fossasia ai-ml technologies and application for product development-chetan kh...
PDF
An Introduction Linear Algebra for Neural Networks and Deep learning
PDF
Introduction to Computer Science
PDF
An introduction to Git with Atlassian Suite
PDF
Think machine-learning-with-scikit-learn-chetan
PDF
A step towards machine learning at accionlabs
DOCX
Voltage measurement using arduino
PPTX
Design & Building Smart Energy Meter
PDF
Data Analytics with Pandas and Numpy - Python
PDF
Internet of things initiative-cskskv
PDF
High level architecture solar power plant
PDF
Alumni talk-university-of-kachchh
PPTX
Pycon 2016-open-space
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
HBase with Apache Spark POC Demo
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Fossasia ai-ml technologies and application for product development-chetan kh...
An Introduction Linear Algebra for Neural Networks and Deep learning
Introduction to Computer Science
An introduction to Git with Atlassian Suite
Think machine-learning-with-scikit-learn-chetan
A step towards machine learning at accionlabs
Voltage measurement using arduino
Design & Building Smart Energy Meter
Data Analytics with Pandas and Numpy - Python
Internet of things initiative-cskskv
High level architecture solar power plant
Alumni talk-university-of-kachchh
Pycon 2016-open-space

Recently uploaded (20)

PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Database Infoormation System (DBIS).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Introduction to Data Science and Data Analysis
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Modelling in Business Intelligence , information system
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Database Infoormation System (DBIS).pptx
climate analysis of Dhaka ,Banglades.pptx
SAP 2 completion done . PRESENTATION.pptx
A Complete Guide to Streamlining Business Processes
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
Data_Analytics_and_PowerBI_Presentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Introduction to Data Science and Data Analysis
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Modelling in Business Intelligence , information system
IBA_Chapter_11_Slides_Final_Accessible.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
DATA COLLECTION METHODS-ppt for nursing research
importance of Data-Visualization-in-Data-Science. for mba studnts
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

  • 1. Chetan Khatri, Data Science Practice Leader. Accionlabs India. PyconLT’19, May 26 - Vilnius Lithuania Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  • 2. Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. ● Apache Spark ● Primary data structures (RDD, DataSet, Dataframe) ● Koalas: pandas API on Apache Spark ● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. ● Parallel read from JDBC: Challenges and best practices. ● Bulk Load API vs JDBC write ● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin ● Avoid unnecessary shuffle ● Alternative to spark default sort ● Why dropDuplicates() doesn’t result consistency, What is alternative ● Optimize Spark stage generation plan ● Predicate pushdown with partitioning and bucketing
  • 4. ● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data processing. ● It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 6. RDD RDD RDD RDD Logical Model Across Distributed Storage on Cluster HDFS, S3
  • 7. RDD RDD RDD T T RDD -> T -> RDD -> T -> RDD T = Transformation
  • 8. Integer RDD String or Text RDD Double or Binary RDD
  • 9. RDD RDD RDD T T RDD RDD RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  • 11. TRANSFORMATIONSACTIONS General Math / Statistical Set Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  • 12. You care about control of dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  • 15. SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 16. // convert RDD -> DF with column names parsedDF = parsedRDD.toDF("project", "sprint", "numStories") // filter, groupBy, sum, and then agg() parsedDF.filter(lambda x: x[1] === "finance") .groupBy("sprint") .agg(sum("numStories").as("count")) .limit(100) .show(100) project sprint numStories finance 3 20 finance 4 22
  • 17. parsedDF.createOrReplaceTempView("audits") results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 18. SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 19. employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01") events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  • 21. ● Pandas - Analyze small datasets. ● Spark - Analyze large size of datasets. Pandas DataFrame Spark DataFrame Column df['col'] df['col'] Mutability Mutable Immutable Add a Column df['Z'] = df['X'] + df['Y'] df.withColumn('Z', df['X'] + df['Y']) Rename columns df.columns = ['X', 'Y'] df.select(df['Q1'].as('X' ), df['P1'].as('Y')) Ref Example, https://github.com/chetkhatri/PyConLT2019
  • 23. Job - Each transformation and action mapping in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  • 28. yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores = 6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  • 36. What happens when you run this code? What would be the impact at Database engine side?
  • 46. JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. # check broadcast join threshold >>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024 10 # logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString # Query plan sampleDF.explain
  • 47. Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  • 48. 1. // disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  • 57. ./bin/spark-submit --conf spark.yarn.maxAppAttempts=1 --name PyConLT19 --master yarn --deploy-mode cluster --driver-memory 18g --executor-memory 24g --num-executors 4 --executor-cores 6 --conf spark.yarn.maxAppAttempts=1 --conf spark.speculation=false --conf spark.broadcast.compress=true --conf spark.sql.broadcastTimeout=36000 --conf spark.network.timeout=2500s --conf spark.dynamicAllocation.executorAllocationRatio=1 --conf spark.executor.heartbeatInterval=30s --conf spark.dynamicAllocation.executorIdleTimeout=60s --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s --conf spark.network.timeout=1200s --conf spark.dynamicAllocation.schedulerBacklogTimeout=15s --conf spark.yarn.maxAppAttempts=1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=True --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAllocation.initialExecutors=2 --conf spark.dynamicAllocation.maxExecutors=6 examples/src/main/python/pi.py
  • 58. [1] Koalas: pandas API on Apache Spark [URL] https://github.com/databricks/koalas [2] An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. https://delta.io [URL] https://github.com/delta-io/delta [3] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks. https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify- and-re-schedule-slow-running-tasks/