SlideShare a Scribd company logo
Chetan Khatri, Lead - Data Science. Accionlabs India.
Paris Scala User Group (PSUG), Paris - France.
23rd May, 2019.
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
M.Sc. - Computer Science from University of Kachchh, India.
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Alternative to spark default sort
● Why dropDuplicates() doesn’t result consistency, What is alternative
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Why not to use Scala Concurrent ‘Future’ explicitly!
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
No more struggles with Apache Spark workloads in production
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation
Integer RDD
String or Text RDD
Double or Binary RDD
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action
Operations
Transformation
Action
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
parsedRDD.filter { case (project, sprint, numStories) => project == "finance" }.
map { case (_, sprint, numStories) => (sprint, numStories) }.
reduceByKey(_ + _).
filter { case (sprint, _) => !isSpecialSprint(sprint) }.
take(100).foreach { case (project, stories) => println(s"project: $stories") }
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
DataFrames
Datasets
Strongly Typing
Ability to use powerful lambda functions.
Spark SQL’s optimized execution engine (catalyst, tungsten)
Can be constructed from JVM objects & manipulated using Functional
transformations (map, filter, flatMap etc)
A DataFrame is a Dataset organized into named columns
DataFrame is simply a type alias of Dataset[Row]
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
Dataset [T]
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
Source: Databricks
Source: Databricks
No more struggles with Apache Spark workloads in production
Executors
Cores
Containers
Stage
Job
Task
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
What happens when you run this code?
What would be the impact at Database engine side?
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt
scala> threshold / 1024 / 1024
res0: Int = 10
// logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
// Query plan
sampleDF.explain
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Code!
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production

More Related Content

PDF
An Introduction to Spark with Scala
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
Mapreduce Algorithms
PDF
2014.06.24.what is ubix
PDF
Next Generation Programming in R
PDF
Tulsa techfest Spark Core Aug 5th 2016
PDF
Pivoting Data with SparkSQL by Andrew Ray
An Introduction to Spark with Scala
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Mapreduce Algorithms
2014.06.24.what is ubix
Next Generation Programming in R
Tulsa techfest Spark Core Aug 5th 2016
Pivoting Data with SparkSQL by Andrew Ray

What's hot (20)

PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PDF
Robert Meyer- pypet
PPTX
Machine learning using spark
PPT
Mapreduce in Search
PPTX
Data engineering and analytics using python
PDF
Dr. Andreas Lattner- Setting up predictive services with Palladium
PDF
4 R Tutorial DPLYR Apply Function
PPT
Spark training-in-bangalore
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Apache Spark: What? Why? When?
PDF
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
ODT
ACADILD:: HADOOP LESSON
PDF
Spark Dataframe - Mr. Jyotiska
PDF
Data profiling with Apache Calcite
PPTX
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
PDF
Don’t optimize my queries, optimize my data!
PDF
SparkSQL and Dataframe
PDF
R programming & Machine Learning
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Robert Meyer- pypet
Machine learning using spark
Mapreduce in Search
Data engineering and analytics using python
Dr. Andreas Lattner- Setting up predictive services with Palladium
4 R Tutorial DPLYR Apply Function
Spark training-in-bangalore
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark: What? Why? When?
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
ACADILD:: HADOOP LESSON
Spark Dataframe - Mr. Jyotiska
Data profiling with Apache Calcite
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Don’t optimize my queries, optimize my data!
SparkSQL and Dataframe
R programming & Machine Learning
Ad

Similar to No more struggles with Apache Spark workloads in production (20)

PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PDF
Fossasia 2018-chetan-khatri
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Tuning and Debugging in Apache Spark
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Scrap Your MapReduce - Apache Spark
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPTX
Spark Kafka summit 2017
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Apache Spark and DataStax Enablement
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Unified Big Data Processing with Apache Spark
PDF
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Big Data Analytics with Apache Spark
DOCX
Quick Guide to Refresh Spark skills
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Fossasia 2018-chetan-khatri
Unified Big Data Processing with Apache Spark (QCON 2014)
Tuning and Debugging in Apache Spark
Simplifying Big Data Analytics with Apache Spark
Scrap Your MapReduce - Apache Spark
Apache spark - Architecture , Overview & libraries
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Spark Kafka summit 2017
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Real-Time Spark: From Interactive Queries to Streaming
Apache Spark and DataStax Enablement
SparkSQL: A Compiler from Queries to RDDs
Unified Big Data Processing with Apache Spark
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Structuring Spark: DataFrames, Datasets, and Streaming
Big Data Analytics with Apache Spark
Quick Guide to Refresh Spark skills
Ad

More from Chetan Khatri (20)

PDF
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
PDF
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
PDF
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
HBase with Apache Spark POC Demo
PDF
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
PDF
Fossasia ai-ml technologies and application for product development-chetan kh...
PDF
An Introduction Linear Algebra for Neural Networks and Deep learning
PDF
Introduction to Computer Science
PDF
An introduction to Git with Atlassian Suite
PDF
Think machine-learning-with-scikit-learn-chetan
PDF
A step towards machine learning at accionlabs
DOCX
Voltage measurement using arduino
PPTX
Design & Building Smart Energy Meter
PDF
Data Analytics with Pandas and Numpy - Python
PDF
Internet of things initiative-cskskv
PDF
High level architecture solar power plant
PDF
Alumni talk-university-of-kachchh
PPTX
Pycon 2016-open-space
PPTX
Pycon india-2016-success-story
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
HBase with Apache Spark POC Demo
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Fossasia ai-ml technologies and application for product development-chetan kh...
An Introduction Linear Algebra for Neural Networks and Deep learning
Introduction to Computer Science
An introduction to Git with Atlassian Suite
Think machine-learning-with-scikit-learn-chetan
A step towards machine learning at accionlabs
Voltage measurement using arduino
Design & Building Smart Energy Meter
Data Analytics with Pandas and Numpy - Python
Internet of things initiative-cskskv
High level architecture solar power plant
Alumni talk-university-of-kachchh
Pycon 2016-open-space
Pycon india-2016-success-story

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Database Infoormation System (DBIS).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Foundation of Data Science unit number two notes
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Lecture1 pattern recognition............
PPT
Quality review (1)_presentation of this 21
PDF
Business Analytics and business intelligence.pdf
Qualitative Qantitative and Mixed Methods.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
Business Acumen Training GuidePresentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Database Infoormation System (DBIS).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to machine learning and Linear Models
Foundation of Data Science unit number two notes
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............
Quality review (1)_presentation of this 21
Business Analytics and business intelligence.pdf

No more struggles with Apache Spark workloads in production

  • 1. Chetan Khatri, Lead - Data Science. Accionlabs India. Paris Scala User Group (PSUG), Paris - France. 23rd May, 2019. Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  • 2. Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. ● Apache Spark ● Primary data structures (RDD, DataSet, Dataframe) ● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. ● Parallel read from JDBC: Challenges and best practices. ● Bulk Load API vs JDBC write ● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin ● Avoid unnecessary shuffle ● Alternative to spark default sort ● Why dropDuplicates() doesn’t result consistency, What is alternative ● Optimize Spark stage generation plan ● Predicate pushdown with partitioning and bucketing ● Why not to use Scala Concurrent ‘Future’ explicitly!
  • 4. ● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data processing. ● It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 6. RDD RDD RDD RDD Logical Model Across Distributed Storage on Cluster HDFS, S3
  • 7. RDD RDD RDD T T RDD -> T -> RDD -> T -> RDD T = Transformation
  • 8. Integer RDD String or Text RDD Double or Binary RDD
  • 9. RDD RDD RDD T T RDD RDD RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  • 11. TRANSFORMATIONSACTIONS General Math / Statistical Set Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  • 12. You care about control of dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  • 13. parsedRDD.filter { case (project, sprint, numStories) => project == "finance" }. map { case (_, sprint, numStories) => (sprint, numStories) }. reduceByKey(_ + _). filter { case (sprint, _) => !isSpecialSprint(sprint) }. take(100).foreach { case (project, stories) => println(s"project: $stories") }
  • 14. val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  • 16. Strongly Typing Ability to use powerful lambda functions. Spark SQL’s optimized execution engine (catalyst, tungsten) Can be constructed from JVM objects & manipulated using Functional transformations (map, filter, flatMap etc) A DataFrame is a Dataset organized into named columns DataFrame is simply a type alias of Dataset[Row]
  • 17. SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 19. // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  • 20. parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 21. // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  • 22. SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 23. employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01") events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  • 28. Job - Each transformation and action mapping in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  • 33. yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores = 6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  • 41. What happens when you run this code? What would be the impact at Database engine side?
  • 51. JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt scala> threshold / 1024 / 1024 res0: Int = 10 // logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString // Query plan sampleDF.explain
  • 52. Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  • 53. 1. // disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  • 64. Code!