SlideShare a Scribd company logo
Custom	applications	with	
Spark’s	RDD
Tejas	Patil
Facebook
Agenda
• Use	case
• Real	world	applications
• Previous	solution
• Spark	version
• Data	skew
• Performance	evaluation
N-gram	language	model	training
Can	you	please	come	here ?
History
5-gram
Word	being	predicted
Real	world	applications
Auto-subtitling	for	Page	videos
Detecting	low	quality	places
• Non-public	places
• My	home
• Home	sweet	home
• Non-real	places
• Apt	#00,	Fake	lane,	Foo	City,	CA
• Mordor,	Westeros	!!
• Non-suitable	for	watch
• Anything	containing	nudity,	intense	sexuality,	profanity	or	disturbing	content
Previous	solution
Sub-model	1
training	job
Sub-model	2
training	job
Sub-model	`n`
training	job
Interpolation
algorithm
Language	
model
LM	1
LM	2
LM	`n`
…....................
Intermediate
sub	models
Hive	query
Hive	table
Sub-model	2
training	job
Sub-model	`n`
training	job
Interpolation
algorithm
Language	
model
LM	1
LM	2
LM	`n`
…....................
Intermediate
sub	models
Hive	query
Hive	table
Sub-model	1
training	job
INSERT	OVERWRITE	TABLE	sub_model_1
SELECT	....
FROM	(
REDUCE	m.ngram,	m.group_key,	m.count
USING	"./train_model --config=myconfig.json ....”
AS	`ngram`,	`count`,	...
FROM(
SELECT	...
FROM	data_source
WHERE	...	
DISTRIBUTE	BY	group_key
)
)
GROUP	BY	`ngram`
Lessons	learned
• SQL	not	good	choice	for	building	such	applications
• Duplication
• Poor	readability
• Brittle,	no	testing
• Alternatives
• Map-reduce
• Query	templating
• Latency	while	training	with	large	data
Spark	solution
Spark	solution
• Same	high	level	architecture
• Hive	tables	as	final	inputs	and	outputs
• Same	binaries	used	in	Hive	TRANSFORM
• RDD	not	Datasets
• `pipe()`	operator
• Modular,	readable,	maintainable
Configuration
PipelineConfiguration
- where	is	the	input	data	?
- where	to	store	final	output	?
- spark	specific	configs:
"spark.dynamicAllocation.maxExecutors”
"spark.executor.memory”
"spark.memory.storageFraction”
…………
- list	of	ComponentConfiguration
……
Scalability	challenges
• Executors	lost	as	unable	to	heartbeat
• Shuffle	service	OOM
• Frequent	executor	GC
• Executor	OOM
• 2GB	limit	in	Spark	for	blocks
• Exceptions	while	reading	output	stream	of	pipe	process
Scalability	challenges
• Executors	lost	as	unable	to	heartbeat
• Shuffle	service	OOM
• Frequent	executor	GC
• Executor	OOM
• 2GB	limit	in	Spark	for	blocks
• Exceptions	while	reading	output	stream	of	pipe	process
Data	skew
Sub-model	2
training	job
Sub-model	`n`
training	job
Interpolation
algorithm
Language	
model
LM	1
LM	2
LM	`n`
…....................
Intermediate
sub	models
Hive	query
Hive	table
Sub-model	1
training	job
Sub-model	2
training	job
Sub-model	`n`
training	job
Interpolation
algorithm
Language	
model
LM	1
LM	2
LM	`n`
…....................
Intermediate
sub	models
Hive	query
Hive	table
Sub-model	1
training	job
ngram
extraction	
and	counting
Estimation	
and	pruning
normalize
ngram
counts
How	are	you
How	are	they
Its	raining
How	are	we	going	
When	are	we	going
You	are	awesome
They	are	working
…..
…..
Training	dataset
<How	are	we	going>	:	1
….
<How	are	you>	:	1
<How	are	they>	:	1
….
<How	are>	:	4	
<You	are>	:	1
<Its	raining>	:	1
….
<are>	:	6
<you>	:	1
<How>	:	4
…..
Word	count
<How	are	we	going>	:	1
<are	we	going>	:	2
<we	going>	:	2
<going>	:	1
<When	are	we	going>	:	1
<Its	raining>	:	1
<You	are	awesome>	:	1
…..
…..
Word	count
Partition	based	on	
2-word	suffix
<How	are	we	going>	:	1
<are	we	going>	:	2
<we	going>	:	2
<When	are	we	going>	:	1
…..
<Its	raining>	:	1
<You	are	awesome>	:	1
…..
Word	count
<How	are	we	going>	:	1
<are	we	going>	:	2
<we	going>	:	2
<going>	:	1
<When	are	we	going>	:	1
<Its	raining>	:	1
<You	are	awesome>	:	1
…..
….. …..
…..
<are>	:	6
<How>	:	4	
<you>	:	1
<doing>	:	1
<going>	:	1
<awesome>	:	1
<working>	:	1
…..
…..
Frequency	of	
every	word:
0’th	shard
shards	1	to	(n-1)
0-shard	(has	frequency	of	
every	word)	and	is	
shipped	to	all	the	nodes
N-grams	with	same	2-
word	suffix	will	fall	in	the	
same	shard
Distribution	of	shards	(1-word	sharding)
Skewed	shards	due	to	data	
from	frequent	phrases
eg.	“how	to	..”,	“do	you	..”
shards	1	to	(n-1)
Distribution	of	shards	(1-word	sharding)
shards	1	to	(n-1)
0-shard	has	single	word	
frequencies	and	2-word	
frequencies	as	well
Distribution	of	shards	(2-word	sharding)
Solution:	Progressive	sharding
First	iteration
Ignore	skewed	shards
def findLargeShardIds(sc:	SparkContext,	threshold:	Long,	…..):	Set[Int]	=	{
val shardSizesRDD = sc.textFile(shardCountsFile)
.map {	
case	line	=>								
val Array(indexStr,	countStr)	=	line.split('t')
(indexStr.toInt,	countStr.toLong)
}
val largeShardIds =	shardSizesRDD.filter {
case	(index,	count)	=> count	>	threshold	
}.map(_._1)
.collect().toSet
return	largeShardIds
}
First	iteration
Process	all	the	non-skewed	shards
Second	iteration
Effective	0-shard	is	small
Re-shard	left	over	with	2-words	history
Second	iteration
Discard	bigger	shards
Second	iteration
Process	all	the	non-skewed	shards
Continue	with	further	iterations	….
var iterationId =	0
do	{
val currentCounts:	RDD[(String,	Long)]	=	allCounts(iterationId - 1)
val partitioner =	new	PartitionerForNgram(numShards,	iterationId)
val shardCountsFile =	s"${shard_sizes}_$iterationId"
currentCounts
.map(ngram =>	(partitioner.getPartition(ngram._1),	1L))
.reduceByKey(_		+		_)
.saveAsTextFile(shardCountsFile)
largeShardIds =	findLargeShardIds(sc,	config.largeShardThreshold,	shardCountsFile)
trainer.trainedModel (currentCounts,	component,	largeShardIds)
.saveAsObjectFile(s"${component.order}_$iterationId")
iterationId +	1
}	while	(largeShards.nonEmpty)
Performance	evaluation
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Hive Spark
Reserved	CPU	time	(days)
0
1
2
3
4
5
6
7
8
9
Hive Spark
Latency	(hours)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Hive Spark
Reserved	CPU	time	(days)
15x	
efficient
Performance	evaluation
0
1
2
3
4
5
6
7
8
9
Hive Spark
Latency	(hours)
2.6x	
faster
Upstream	contributions	to	pipe()
• [SPARK-13793]	PipedRDD doesn't	propagate	exceptions	while	
reading	parent	RDD
• [SPARK-15826]	PipedRDD to	allow	configurable	char	encoding
• [SPARK-14542]	PipeRDD should	allow	configurable	buffer	size	for	
the	stdin writer
• [SPARK-14110]	PipedRDD to	print	the	command	ran	on	non	zero	
exit
Questions	?

More Related Content

PPTX
ONNX and MLflow
PDF
MongoDB Aggregation Framework
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
PDF
Apache Spark 101
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Parquet performance tuning: the missing guide
PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
PDF
Introducing DataFrames in Spark for Large Scale Data Science
ONNX and MLflow
MongoDB Aggregation Framework
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Apache Spark 101
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Parquet performance tuning: the missing guide
KFServing, Model Monitoring with Apache Spark and a Feature Store
Introducing DataFrames in Spark for Large Scale Data Science

What's hot (20)

PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
MySQL Query Optimization
PDF
Virtual Nodes: Rethinking Topology in Cassandra
KEY
Object Calisthenics Applied to PHP
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PDF
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
PPTX
ORC File - Optimizing Your Big Data
PDF
Closures in Javascript
PDF
Lets make a better react form
PDF
Dreaming Infrastructure
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PDF
Serverless with Spring Cloud Function, Knative and riff #SpringOneTour #s1t
PDF
Care and Feeding of Catalyst Optimizer
PDF
Computer Graphics in Java and Scala - Part 1
PPTX
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PPTX
KERAS Python Tutorial
PDF
Return to dlresolve
Apache Calcite (a tutorial given at BOSS '21)
MySQL Query Optimization
Virtual Nodes: Rethinking Topology in Cassandra
Object Calisthenics Applied to PHP
Transformers In Vision From Zero to Hero (DLI).pptx
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
ORC File - Optimizing Your Big Data
Closures in Javascript
Lets make a better react form
Dreaming Infrastructure
A Deep Dive into Query Execution Engine of Spark SQL
Big data components - Introduction to Flume, Pig and Sqoop
Serverless with Spring Cloud Function, Knative and riff #SpringOneTour #s1t
Care and Feeding of Catalyst Optimizer
Computer Graphics in Java and Scala - Part 1
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
KERAS Python Tutorial
Return to dlresolve
Ad

Viewers also liked (20)

PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PDF
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
PDF
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
PDF
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Ad

Similar to Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil (20)

PPTX
Using Spark's RDD APIs for complex, custom applications
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
Apache Spark Introduction @ University College London
PPTX
Spark core
PPTX
Big data clustering
PDF
Apache Spark Introduction
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Apache Spark on HDinsight Training
PDF
A Deep Dive Into Spark
PPTX
SparkNotes
PPTX
Apache Spark Architecture
PDF
Spark streaming , Spark SQL
PPT
Bigdata processing with Spark - part II
PDF
Introduction to Spark Training
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
Using Spark's RDD APIs for complex, custom applications
Spark Summit East 2015 Advanced Devops Student Slides
Big data vahidamiri-tabriz-13960226-datastack.ir
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Introduction @ University College London
Spark core
Big data clustering
Apache Spark Introduction
Simplifying Big Data Analytics with Apache Spark
Apache Spark on HDinsight Training
A Deep Dive Into Spark
SparkNotes
Apache Spark Architecture
Spark streaming , Spark SQL
Bigdata processing with Spark - part II
Introduction to Spark Training
Intro to Apache Spark by CTO of Twingo
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Introduction to the R Programming Language
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Introduction to Data Science and Data Analysis
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
Predictive modeling basics in data cleaning process
PPTX
Leprosy and NLEP programme community medicine
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
SAP 2 completion done . PRESENTATION.pptx
DOCX
Factor Analysis Word Document Presentation
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Introduction to the R Programming Language
ISS -ESG Data flows What is ESG and HowHow
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Introduction to Data Science and Data Analysis
STERILIZATION AND DISINFECTION-1.ppthhhbx
Predictive modeling basics in data cleaning process
Leprosy and NLEP programme community medicine
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
SAP 2 completion done . PRESENTATION.pptx
Factor Analysis Word Document Presentation
[EN] Industrial Machine Downtime Prediction
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
New ISO 27001_2022 standard and the changes
Introduction to Inferential Statistics.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
retention in jsjsksksksnbsndjddjdnFPD.pptx
DU, AIS, Big Data and Data Analytics.ppt
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja

Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil