SlideShare a Scribd company logo
Spark @ Bloomberg:
Dynamic Composable Analytics
Partha Nageswaran
Sudarshan Kadambi
BLOOMBERG L.P.
Bloomberg Spark Server
(Persistent)
Spark
Context
Request Handler
2
Spark	Serverization at	Bloomberg	has	culminated	in	the	creation	of	the	
Bloomberg	Spark	Server
Function
Transform
Registry (FTR)
Managed
DataFrame
Registry
Ingestion
Manager
Request
Processor
Request
Processor
Declarative Query
Request
Processor
JVM
Spark Serverization – Motivation
3
• Stand-alone	Spark	Apps	on	isolated	clusters	pose	
challenges:
– Redundancy	in:
» Crafting	and	Managing	of	RDDs/DFs
» Coding	of	the	same	or	similar	types	of	transforms/actions
– Management	of	clusters,	replication	of	data,	etc.
– Analytics	are	confined	 to	specific	content	sets	making	
Cross-Asset	Analytics much	harder
– Need	to	handle	Real-time	ingestion	 in	each	App
Spark
Cluster
Spark
App
Spark
Cluster
Spark
Server
Spark
App
Spark
App
Spark
Cluster
Spark
App
Dynamic Composable Analytics
• Compositional	 Analytics	are	common	 place	in	the	Financial	Sector
Decile Rank	the	14-day	Relative	Strength	Index	of	Active	Equity	Stocks:
DECILE(
RSI(
Price,	
14,	
['IBM	US	Equity',	'VOD	LN	Equity',	…	]						
)
)		
• Price	is	data	abstracted	as	a	Spark	Data	Frame	
• RSI,	DECILE	are	building	 block	analytics,	expressible	as	Spark	transforms	and	actions
4
Dynamic Composable Analytics
• Another	usecase may	want	to	compose	Percentile	with	RSI
Percentile	Rank	the	14-day	Relative	Strength	Index	of	Active	Equity	Stocks:
PERCENTILE(
RSI(
Price,	
14,	
['IBM	US	Equity',	'VOD	LN	Equity',	…	]						
)
)		
• Or	Percentile	with	ROC,	etc.	And	the	compositions	maybe	arbitrarily	complex
5
Dynamic Composable Analytics
def RSI(df:	DataFrame,	period:	Int=14) : DataFrame =	{
val smmaCoeff =	udf(	(i:Double)	=>	scala.math.pow(period-1,	i-1)/scala.math.pow(period,i)	)
val rsi_from_rs=	udf(	(n:Double,	d:Double)	=>	100	- 100*d/(d+n)	)
val rsi_window=	Window.partitionBy('id).orderBy('date.desc)
df.withColumn("weight",	smmaCoeff(row_number.over(rsi_window)))
.withColumn("diff",	'value	- lead('value,1).over(rsi_window))
.withColumn("U",	when('diff	>	0,	'diff).otherwise(0))
.withColumn("D",	when('diff	<	0,	abs('diff)).otherwise(0))
.groupBy('id).agg(rsi_from_rs(	sum('U	*	'weight),	sum('D	*	'weight)	)as	'value)
}
def Decile(df:	Dataframe) : DataFrame =	{
df.withColumn("value",	ntile(10).over(	Window.orderBy('value.desc)	)	)
}
Ack: Andrew Foster
6
Function Transform Registry
• Maintain	a	Registry	of	Analytic	functions	 (FTR)	with	functions	 expressed	as	
Parametrized	Spark	Transforms	and	Actions
• Functions	can	compose	other	functions,	along	with	additional	transforms,	within	
the	Registry
• Registry	supports	 'bind'	and	'lookup'	operations
7
Function Transform
Registry (FTR)
Decile
FUNCTIONS SPARK IMPL.
…
Percentile …
… …
Bloomberg Spark Server
(Persistent)
Spark Context
Request Handler
8
Function
Transform
Registry (FTR)
JVM
Request Processor
• Request	Processors	(RPs)	are	spark	applications	that	orchestrate	composition	 of	
analytics	on	Data	Frames
– RPs	comply	with	a	specification	that	allows	them	to	be	hosted	by	the	Bloomberg	Spark	Server
– Each	request	(such	as:	compute	the	Decile Rank	of	the	RSI)	is	handled	by	a	Request	Processor	that	
looks	up	functions	from	the	FTR,	Composes	them	and	applies	them	to	Data	Frames	
9
Request
Handler
Request
Processor
.
FTR
Declarative Query
Request
Processor
Bloomberg Spark Server
(Peristent)
Spark
Context
Request Handler
10
Function
Transform
Registry (FTR)
JVM
Request
Processor
Request
Processor
Declarative Query
Request
Processor
Managed Data Frames
• Besides	locating	functions	from	the	FTR,	Request	Processors	
have	to	pass	in	Data	Frames	to	these	functions	as	inputs
• Rather	than	instantiate	Data	Frames,	lookup	Data	Frames	
from	a	Data	Frames	Registry
– Such	Data	Frames	are	called	Managed	Data	Frames	(MDF)
– The	Registry	that	Manages	these	Data	Frames	is	the	Managed	Data	
Frame	Registry	(MDF	Registry)
11
Introducing Managed DataFrames (MDFs)
• A	Managed	DataFrame (MDF)	is	a	named	
DataFrame,	optionally	combined	with	
Execution	Metadata
– MDFs	can	be	located	by	name	OR	by	any	Column	
Name	defined	 in	the	Schema	of	the	corresponding	
DF
• Execution	Metadata	includes:	
– Data	Distribution metadata	captures	information	
about	the	data	depth,	 histogram	information,	 etc.
– E.g.:	A	managed	DataFrame for	pricing	of	stocks,	
representing	 2	years	of	historical	data	 and	another	
for	representing	 30	years	of	historical	data
MDF
Price DF
<ID, Price>
Name:
Shallow
PriceMDF
Execution
Metadata:
* 2 Yr Price
History
MDF
Price DF
<ID, Price>
Name:
Deep
PriceMDF
Execution
Metadata:
* 30 Yr Price
History
12
Managed DataFrames
– Data	Derivation	metadata	which	are	
mathematical	expressions	that	define	how	
additional	columns	can	be	synthesized	from	
existing	columns	in	the	schema
– E.g.:	adjPrice is	a	derived	Column,	 defined	
in	terms	of	the	base	Price	column
– In	essence,	an	MDF	with	data	derivation	
metadata	have	a	Schema	that	is	a	union	of	
the	contained	DF	and	the	derived	columns
MDF
Name:
Shallow
PriceDF
Execution
Metadata:
* 2 Yr Price
History
* adjPrice =
Price – 3% of
Price
Price DF
<ID, Price>
MDF
Name:
Deep
PriceDF
Execution
Metadata:
* 30 Yr Price
History
* adjPrice =
Price – 1.75% of
Price
Price DF
<ID, Price>
13
The MDF Registry
• The	MDF	Registry	within	the	Bloomberg	
Spark	Server	 provides	support	for:
– Binding	MDFs	by	Name
– Looking	up	MDFs	by	Name
– Looking	up	MDF	by	a	Column	 Name	(an	
element	of	the	MDF	Schema),	etc.
• The	MDF	Registry	maintains	a	'table'	that	
associates	the	Name	of	the	MDF	with	the	
DF	reference	 and	Columns	in	the	DF
MDF	Registry
Name Columns DF
Ref.
Meta
Data
Shallow
Price
DF
Price,	
adjPrice
… …
Deep
Price
DF
……
…
Price,	
adjPrice
14
Bloomberg Spark Server
(Peristent)
Spark
Context
Request Handler
15
Function
Transform
Registry (FTR)
JVM
Request
Processor
Request
Processor
Declarative Query
Request
Processor
Managed
DataFrame
Registry
Flow of Requests
• Request	Processors	within	the	Spark	Server	
orchestrate	analytics
– These	RPs	have	access	to	the	Registry	and	FTRs
– Are	responsible	for	composing	transforms	and	
actions	on	one	or	more	MDFs
– May	dynamically	bind	additional	MDFs	
(materialized	or	otherwise)	for	use	by	other	Apps	
Request
Handler
Request
Processor
.
MDF
Registry
lookup
MDFs
FTRs
apply
Function
MDFs
decorate
with
Transforms
collect
16
Bloomberg Spark Server
Spark
Context
Request
Processor
Request
Processor
Declarative Query
Request Processor
Request Handler
MDF Registry
MDF
17
Function
Transform
Registry (FTR)
RSI …
use MDF
MDF
MDF
17
Bloomberg Spark Server
Spark
Context
Request
Processor
Request
Processor
Declarative Query
Request Processor
Request Handler
MDF Registry
18
Function
Transform
Registry (FTR)
RSI …
use
18
Ingestion
Manager
MDF1
MDF2
1 2
1 2
Schema Repository
19
• Enterprise-wide	data	pipeline
• External	(to	Spark)	schema	repository	and	service
• Enables	MDF	lookup	by	a	dataset	schema	element
• Analytic	expressions	can	now	be	composed	over	data	elements
Execution Metadata
20
• Dataset	Source	Connection	Identifiers
• Backing	Stores
• Real-time	 Topics
• Storage	Level	&	Refresh	Rate
• Subset	Predicate,	etc.
Ad-hoc Cross-Domain Analytics
21
• Registration	of	pre-materialized	DataFrames
• Collaborative	analytics	between	application	workflows	
• Dynamic	creation	of	Managed	DataFrames
• Spark	Servers	have	data	pertaining	to	a	single	domain	materialized
• Ad-hoc cross-domain	analytics	requires	capability	to	synthesize	MDFs	
on	demand
Content Subsetting
22
• High	value	data	sub-setted within	Spark
• Reduce	cost	of	querying	external	datastore
• Specified	as	a	filter	predicate	at	time	of	registration
• E.g.	Member	companies	of	popular	indices	[Dow	30,	S&P	500,…]	
have	records	placed	within	Spark
Content Subsetting
23
• Seamless	unification	of	data	in	Spark	(DFsubset)and	backing	store	
(DFsubset’)
(DFsubset U	DFsubset’).filter(query)	=	 DFsubset.filter(query)		U		DFsubset’.filter(query)
• Dataset	owners	provided	knobs	for	cost	vs	performance.
• LRU	cache	like	mechanism	planned	in	the	future
• Make	sense	as	a	capability	native	to	Spark	dataframes
Ingestion: Periodic Refresh
24
• Periodic	data	pull	into	Spark	from	the	backing	store
• Subset	criteria	applied	during	data	retrieval
• Used	when	a	dataset	has	a	backing	store,	but	no	real	time	
update	stream	that	we	can	tap	into
• Dataset	owners	have	control	over	storage	level	of	the	
dataframes created	within	a	given	MDF
Ingestion: Stream Reconciliation
25
• Analytics	needs	to	be	low-latency	with	respect	to	queries,	but	
also	data	freshness
• Since	data	is	being	sub-setted within	Spark,	need	to	keep	the	
subset	up	to	date
• Datasets	published	to	different	Kafka	topics.
• 1:1	mapping	between	datasets,	topics	and	DStreams.
Ingestion: Stream Reconciliation
26
Backing Store
U1 U2 U3 UN DFsubset
S1 S2 S3 SN
DFN
MDF -
PriceHistory
Real-Time
Stream
(update state)
(Avro Deserialize,
Subset Predicate)
(convert to DF-seq)
Similar intent as Structured Streaming, to be introduced in
Spark 2.0
Ingestion: Data Transformation
• Data	in	backing	stores	may	need	representation	transforms	
before	being	used	in	queries
• Data	in	multiple	tables	denormalized into	a	single	DF	within	Spark
• Or,	quickly	see	effect	of	different	storage	representations	on	
performance,	without	changing	the	representation	in	the	backing	store
• Implemented	via.	user	transforms	associated	with	a	given	MDF
Spark Server: Memory Management
28
• An	MDF	contains	multiple	generation	of	DFs,	being	generated	
and	destroyed
• Multiple	generations	operated	upon	by	RPs	at	given	point	in	
time
• Reference	counting	to	keep	track	of	what	DFs	are	being	used	
and	by	whom
• Long	running	queries	aborted	for	forced	reclamation
Query Consistency
29
• Multiple	queries	need	to	operate	on	same	snapshot	of	data
• How	to	achieve,	if	data	constantly	changing	underneath?
• Each	DF	within	MDF	associated	with	time	epoch
• Registry	lookup	with	a	reference	time
• Time-align	sub-setted dataframes	with	data	in	backing	store
Spark for Online Analytics
30
– High	Availability	of	Spark	Driver
• High	bootstrap	cost	to	reconstructing	cluster	and	cached	state
• Naïve	HA	models	(such	as	multiple	active	clusters)	surface	query	inconsistency
– High	Availability	of	RDD	Partitions
• With	subset	or	universe	cached,	lost	RDD	partitions	kill	query	performance
– Performance	Consistency
• Performance	gated	by	slowest	executor
• High	Availability	and	Low	Tail	Latency	closely	related
– Interactions	effects	between	low-latency	queries	and	low-latency	updates
• No	to	Minimal	sandboxing	between	jobs	sharing	executor	JVMs
First	Bloomberg	contribution:	SPARK-15352
Spark Server Acknowledgements
Andrew Foster Joe Davey Shubham Chopra
Nimbus Goehausen Tracy Liang
THANK YOU.
pnageswaran@bloomberg.net
skadambi@bloomberg.net

More Related Content

PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
PPTX
Apache Spark and Online Analytics
PDF
End-to-End Deep Learning with Horovod on Apache Spark
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
PDF
Operationalize Apache Spark Analytics
PDF
Re-Architecting Spark For Performance Understandability
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PPTX
Spark r under the hood with Hossein Falaki
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Apache Spark and Online Analytics
End-to-End Deep Learning with Horovod on Apache Spark
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Operationalize Apache Spark Analytics
Re-Architecting Spark For Performance Understandability
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark r under the hood with Hossein Falaki

What's hot (20)

PDF
Operational Tips For Deploying Apache Spark
PDF
Memory Management in Apache Spark
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Spark Summit EU talk by Elena Lazovik
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PPTX
Rds data lake @ Robinhood
PDF
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Operational Tips For Deploying Apache Spark
Memory Management in Apache Spark
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Deploying Accelerators At Datacenter Scale Using Spark
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Apache Spark MLlib 2.0 Preview: Data Science and Production
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Rds data lake @ Robinhood
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Resource-Efficient Deep Learning Model Selection on Apache Spark
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Speed up UDFs with GPUs using the RAPIDS Accelerator
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Ad

Viewers also liked (20)

PDF
Low Latency Execution For Apache Spark
PDF
Spark Uber Development Kit
PDF
Re-Architecting Spark For Performance Understandability
PDF
Creating New Streams: Presented by Dennis Gove, Bloomberg LP
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Big Data in Production: Lessons from Running in the Cloud
PDF
Livy: A REST Web Service For Apache Spark
PDF
Spark on Mesos
PDF
Airstream: Spark Streaming At Airbnb
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Solr As A SparkSQL DataSource
PDF
Spatial Analysis On Histological Images Using Spark
PPTX
Spark etl
PDF
High-Performance Python On Spark
Low Latency Execution For Apache Spark
Spark Uber Development Kit
Re-Architecting Spark For Performance Understandability
Creating New Streams: Presented by Dennis Gove, Bloomberg LP
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
RISELab:Enabling Intelligent Real-Time Decisions
Time-Evolving Graph Processing On Commodity Clusters
Big Data in Production: Lessons from Running in the Cloud
Livy: A REST Web Service For Apache Spark
Spark on Mesos
Airstream: Spark Streaming At Airbnb
Elasticsearch And Apache Lucene For Apache Spark And MLlib
A Graph-Based Method For Cross-Entity Threat Detection
Spark and Couchbase: Augmenting the Operational Database with Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Building Custom Machine Learning Algorithms With Apache SystemML
Solr As A SparkSQL DataSource
Spatial Analysis On Histological Images Using Spark
Spark etl
High-Performance Python On Spark
Ad

Similar to Spark at Bloomberg: Dynamically Composable Analytics (20)

PDF
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
PPTX
Informix MQTT Streaming
PDF
Vision2015-CBS-1148-Final
PPTX
Thing you didn't know you could do in Spark
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
PPTX
Connect Data Strategy Deep Dive - MAZ Workshop (1).pptx
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
PPTX
ParStream - Big Data for Business Users
PDF
Big Data, Mob Scale.
PDF
Big Events, Mob Scale - Darach Ennis (Push Technology)
PPTX
Learn How to Run Python on Redshift
PPTX
Accelerating analytics on the Sensor and IoT Data.
PPTX
How Spark is Enabling the New Wave of Converged Applications
PDF
BI Masterclass slides (Reference Architecture v3)
PDF
SnappyData Toronto Meetup Nov 2017
PPTX
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
PPTX
SoftServe BI/BigData Workshop in Utah
PDF
The sensor data challenge - Innovations (not only) for the Internet of Things
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
PDF
Data Architecture for Machine Learning
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Informix MQTT Streaming
Vision2015-CBS-1148-Final
Thing you didn't know you could do in Spark
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
Connect Data Strategy Deep Dive - MAZ Workshop (1).pptx
Explore big data at speed of thought with Spark 2.0 and Snappydata
ParStream - Big Data for Business Users
Big Data, Mob Scale.
Big Events, Mob Scale - Darach Ennis (Push Technology)
Learn How to Run Python on Redshift
Accelerating analytics on the Sensor and IoT Data.
How Spark is Enabling the New Wave of Converged Applications
BI Masterclass slides (Reference Architecture v3)
SnappyData Toronto Meetup Nov 2017
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
SoftServe BI/BigData Workshop in Utah
The sensor data challenge - Innovations (not only) for the Internet of Things
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Data Architecture for Machine Learning

More from Jen Aman (14)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
GPU Computing With Apache Spark And Python
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
EclairJS = Node.Js + Apache Spark
PDF
Spark: Interactive To Production
PDF
Scalable Deep Learning Platform On Spark In Baidu
PDF
Scaling Machine Learning To Billions Of Parameters
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
PDF
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Efficient State Management With Spark 2.0 And Scale-Out Databases
GPU Computing With Apache Spark And Python
Spark And Cassandra: 2 Fast, 2 Furious
EclairJS = Node.Js + Apache Spark
Spark: Interactive To Production
Scalable Deep Learning Platform On Spark In Baidu
Scaling Machine Learning To Billions Of Parameters
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Utilizing Human Data Validation For KPI Analysis And Machine Learning

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Introduction to the R Programming Language
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Business Analytics and business intelligence.pdf
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Fluorescence-microscope_Botany_detailed content
Introduction to the R Programming Language
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Analytics and business intelligence.pdf
[EN] Industrial Machine Downtime Prediction
Computer network topology notes for revision
Quality review (1)_presentation of this 21
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx

Spark at Bloomberg: Dynamically Composable Analytics

  • 1. Spark @ Bloomberg: Dynamic Composable Analytics Partha Nageswaran Sudarshan Kadambi BLOOMBERG L.P.
  • 2. Bloomberg Spark Server (Persistent) Spark Context Request Handler 2 Spark Serverization at Bloomberg has culminated in the creation of the Bloomberg Spark Server Function Transform Registry (FTR) Managed DataFrame Registry Ingestion Manager Request Processor Request Processor Declarative Query Request Processor JVM
  • 3. Spark Serverization – Motivation 3 • Stand-alone Spark Apps on isolated clusters pose challenges: – Redundancy in: » Crafting and Managing of RDDs/DFs » Coding of the same or similar types of transforms/actions – Management of clusters, replication of data, etc. – Analytics are confined to specific content sets making Cross-Asset Analytics much harder – Need to handle Real-time ingestion in each App Spark Cluster Spark App Spark Cluster Spark Server Spark App Spark App Spark Cluster Spark App
  • 4. Dynamic Composable Analytics • Compositional Analytics are common place in the Financial Sector Decile Rank the 14-day Relative Strength Index of Active Equity Stocks: DECILE( RSI( Price, 14, ['IBM US Equity', 'VOD LN Equity', … ] ) ) • Price is data abstracted as a Spark Data Frame • RSI, DECILE are building block analytics, expressible as Spark transforms and actions 4
  • 5. Dynamic Composable Analytics • Another usecase may want to compose Percentile with RSI Percentile Rank the 14-day Relative Strength Index of Active Equity Stocks: PERCENTILE( RSI( Price, 14, ['IBM US Equity', 'VOD LN Equity', … ] ) ) • Or Percentile with ROC, etc. And the compositions maybe arbitrarily complex 5
  • 6. Dynamic Composable Analytics def RSI(df: DataFrame, period: Int=14) : DataFrame = { val smmaCoeff = udf( (i:Double) => scala.math.pow(period-1, i-1)/scala.math.pow(period,i) ) val rsi_from_rs= udf( (n:Double, d:Double) => 100 - 100*d/(d+n) ) val rsi_window= Window.partitionBy('id).orderBy('date.desc) df.withColumn("weight", smmaCoeff(row_number.over(rsi_window))) .withColumn("diff", 'value - lead('value,1).over(rsi_window)) .withColumn("U", when('diff > 0, 'diff).otherwise(0)) .withColumn("D", when('diff < 0, abs('diff)).otherwise(0)) .groupBy('id).agg(rsi_from_rs( sum('U * 'weight), sum('D * 'weight) )as 'value) } def Decile(df: Dataframe) : DataFrame = { df.withColumn("value", ntile(10).over( Window.orderBy('value.desc) ) ) } Ack: Andrew Foster 6
  • 7. Function Transform Registry • Maintain a Registry of Analytic functions (FTR) with functions expressed as Parametrized Spark Transforms and Actions • Functions can compose other functions, along with additional transforms, within the Registry • Registry supports 'bind' and 'lookup' operations 7 Function Transform Registry (FTR) Decile FUNCTIONS SPARK IMPL. … Percentile … … …
  • 8. Bloomberg Spark Server (Persistent) Spark Context Request Handler 8 Function Transform Registry (FTR) JVM
  • 9. Request Processor • Request Processors (RPs) are spark applications that orchestrate composition of analytics on Data Frames – RPs comply with a specification that allows them to be hosted by the Bloomberg Spark Server – Each request (such as: compute the Decile Rank of the RSI) is handled by a Request Processor that looks up functions from the FTR, Composes them and applies them to Data Frames 9 Request Handler Request Processor . FTR Declarative Query Request Processor
  • 10. Bloomberg Spark Server (Peristent) Spark Context Request Handler 10 Function Transform Registry (FTR) JVM Request Processor Request Processor Declarative Query Request Processor
  • 11. Managed Data Frames • Besides locating functions from the FTR, Request Processors have to pass in Data Frames to these functions as inputs • Rather than instantiate Data Frames, lookup Data Frames from a Data Frames Registry – Such Data Frames are called Managed Data Frames (MDF) – The Registry that Manages these Data Frames is the Managed Data Frame Registry (MDF Registry) 11
  • 12. Introducing Managed DataFrames (MDFs) • A Managed DataFrame (MDF) is a named DataFrame, optionally combined with Execution Metadata – MDFs can be located by name OR by any Column Name defined in the Schema of the corresponding DF • Execution Metadata includes: – Data Distribution metadata captures information about the data depth, histogram information, etc. – E.g.: A managed DataFrame for pricing of stocks, representing 2 years of historical data and another for representing 30 years of historical data MDF Price DF <ID, Price> Name: Shallow PriceMDF Execution Metadata: * 2 Yr Price History MDF Price DF <ID, Price> Name: Deep PriceMDF Execution Metadata: * 30 Yr Price History 12
  • 13. Managed DataFrames – Data Derivation metadata which are mathematical expressions that define how additional columns can be synthesized from existing columns in the schema – E.g.: adjPrice is a derived Column, defined in terms of the base Price column – In essence, an MDF with data derivation metadata have a Schema that is a union of the contained DF and the derived columns MDF Name: Shallow PriceDF Execution Metadata: * 2 Yr Price History * adjPrice = Price – 3% of Price Price DF <ID, Price> MDF Name: Deep PriceDF Execution Metadata: * 30 Yr Price History * adjPrice = Price – 1.75% of Price Price DF <ID, Price> 13
  • 14. The MDF Registry • The MDF Registry within the Bloomberg Spark Server provides support for: – Binding MDFs by Name – Looking up MDFs by Name – Looking up MDF by a Column Name (an element of the MDF Schema), etc. • The MDF Registry maintains a 'table' that associates the Name of the MDF with the DF reference and Columns in the DF MDF Registry Name Columns DF Ref. Meta Data Shallow Price DF Price, adjPrice … … Deep Price DF …… … Price, adjPrice 14
  • 15. Bloomberg Spark Server (Peristent) Spark Context Request Handler 15 Function Transform Registry (FTR) JVM Request Processor Request Processor Declarative Query Request Processor Managed DataFrame Registry
  • 16. Flow of Requests • Request Processors within the Spark Server orchestrate analytics – These RPs have access to the Registry and FTRs – Are responsible for composing transforms and actions on one or more MDFs – May dynamically bind additional MDFs (materialized or otherwise) for use by other Apps Request Handler Request Processor . MDF Registry lookup MDFs FTRs apply Function MDFs decorate with Transforms collect 16
  • 17. Bloomberg Spark Server Spark Context Request Processor Request Processor Declarative Query Request Processor Request Handler MDF Registry MDF 17 Function Transform Registry (FTR) RSI … use MDF MDF MDF 17
  • 18. Bloomberg Spark Server Spark Context Request Processor Request Processor Declarative Query Request Processor Request Handler MDF Registry 18 Function Transform Registry (FTR) RSI … use 18 Ingestion Manager MDF1 MDF2 1 2 1 2
  • 19. Schema Repository 19 • Enterprise-wide data pipeline • External (to Spark) schema repository and service • Enables MDF lookup by a dataset schema element • Analytic expressions can now be composed over data elements
  • 20. Execution Metadata 20 • Dataset Source Connection Identifiers • Backing Stores • Real-time Topics • Storage Level & Refresh Rate • Subset Predicate, etc.
  • 21. Ad-hoc Cross-Domain Analytics 21 • Registration of pre-materialized DataFrames • Collaborative analytics between application workflows • Dynamic creation of Managed DataFrames • Spark Servers have data pertaining to a single domain materialized • Ad-hoc cross-domain analytics requires capability to synthesize MDFs on demand
  • 22. Content Subsetting 22 • High value data sub-setted within Spark • Reduce cost of querying external datastore • Specified as a filter predicate at time of registration • E.g. Member companies of popular indices [Dow 30, S&P 500,…] have records placed within Spark
  • 23. Content Subsetting 23 • Seamless unification of data in Spark (DFsubset)and backing store (DFsubset’) (DFsubset U DFsubset’).filter(query) = DFsubset.filter(query) U DFsubset’.filter(query) • Dataset owners provided knobs for cost vs performance. • LRU cache like mechanism planned in the future • Make sense as a capability native to Spark dataframes
  • 24. Ingestion: Periodic Refresh 24 • Periodic data pull into Spark from the backing store • Subset criteria applied during data retrieval • Used when a dataset has a backing store, but no real time update stream that we can tap into • Dataset owners have control over storage level of the dataframes created within a given MDF
  • 25. Ingestion: Stream Reconciliation 25 • Analytics needs to be low-latency with respect to queries, but also data freshness • Since data is being sub-setted within Spark, need to keep the subset up to date • Datasets published to different Kafka topics. • 1:1 mapping between datasets, topics and DStreams.
  • 26. Ingestion: Stream Reconciliation 26 Backing Store U1 U2 U3 UN DFsubset S1 S2 S3 SN DFN MDF - PriceHistory Real-Time Stream (update state) (Avro Deserialize, Subset Predicate) (convert to DF-seq) Similar intent as Structured Streaming, to be introduced in Spark 2.0
  • 27. Ingestion: Data Transformation • Data in backing stores may need representation transforms before being used in queries • Data in multiple tables denormalized into a single DF within Spark • Or, quickly see effect of different storage representations on performance, without changing the representation in the backing store • Implemented via. user transforms associated with a given MDF
  • 28. Spark Server: Memory Management 28 • An MDF contains multiple generation of DFs, being generated and destroyed • Multiple generations operated upon by RPs at given point in time • Reference counting to keep track of what DFs are being used and by whom • Long running queries aborted for forced reclamation
  • 29. Query Consistency 29 • Multiple queries need to operate on same snapshot of data • How to achieve, if data constantly changing underneath? • Each DF within MDF associated with time epoch • Registry lookup with a reference time • Time-align sub-setted dataframes with data in backing store
  • 30. Spark for Online Analytics 30 – High Availability of Spark Driver • High bootstrap cost to reconstructing cluster and cached state • Naïve HA models (such as multiple active clusters) surface query inconsistency – High Availability of RDD Partitions • With subset or universe cached, lost RDD partitions kill query performance – Performance Consistency • Performance gated by slowest executor • High Availability and Low Tail Latency closely related – Interactions effects between low-latency queries and low-latency updates • No to Minimal sandboxing between jobs sharing executor JVMs First Bloomberg contribution: SPARK-15352
  • 31. Spark Server Acknowledgements Andrew Foster Joe Davey Shubham Chopra Nimbus Goehausen Tracy Liang