SlideShare a Scribd company logo
Presto/Accumulo
Lessons	Learned
Adam	Shook Datacatessen
Datacatessen
Abstract
The	Presto-Accumulo	connector	has	been	in	
production	for	over	18	months.		It's	been	successful	
overall,	but	we	have	had	some	pain	points	along	
the	way	with	initial	design	decisions	and	tech	debt.
During	this	session,	we'll	briefly	review	the	
Accumulo	connector	for	Presto	and	discuss	the	use	
case	that	led	to	its	development.		We'll	discuss	the	
pain	points	we	have	experienced	with	the	
connector,	and	the	latest	features	and	changes	to	
the	connector	to	improve	query	performance,	
ingestion,	and	ease	of	use.
Datacatessen
Outline
• Presto-Accumulo	Overview
• Use	Case	Review
• Lessons	Learned	and	New	Features
Datacatessen
Presto-Accumulo	Review
• Open-source	and	built	by	Facebook
– MPP	OLAP	engine	with	pluggable	storage
• ANSI	SQL	for	NoSQL
• Aim	is	to	accelerate	relational	OLTP	use	cases	by	abstracting	
away	common	Accumulo	design	patterns
• Load	data	using	SQL	or	Java
• Supports	predicate	pushdown	via	advanced	indexes	and	
metrics
• Queries	ranging	from	milliseconds	to	seconds
• Available	since	Presto	0.153
– See	https://github.com/bloomberg/presto for	the	latest	
features	
Datacatessen
Client Coordinator
WorkerWorker Worker Worker
Accumulo
Coordinator	leverages	
indexes	and	optimizations	
to	gather	Ranges	to	scan
Each	worker	is	given	a	
subset	of	the	Ranges	to	
read	from	Accumulo in	
parallel	via	
BatchScanners
Workers	pull	data	from	
Accumulo,	converting	it	
into	Presto’s	internal	
object	model
Accumulo’s job	is	done,	
Presto	takes	over	to	
shuffle	data	as	needed	
and	complete	the	query
Presto/Accumulo Workflow
Datacatessen
Bloomberg	Use	Case
• Surveillance	application	for	Compliance	Officer	to	review	
events
• Web	application	uses	JDBC	to	execute	SQL	queries	against	
Presto
• Ingest	is	done	via	Storm	topology	using	the	
PrestoBatchWriter
• Use	case	heavily	relies	on	event	time	index	to	retrieve	any	
recent	events
• Query	performance	ranges	from	10	milliseconds	to	~10	
seconds	for	most	common	queries	with	table	size	in	TBs	
and	thousands	of	tablets
• Presto	deployed	on	Mesos
Datacatessen
LESSONS	LEARNED
And	the	stuff	done	to	make	it	better
Datacatessen
Dropping/Updating	Data
• No	trivial	way	to	age	off	or	update	data	using	
the	PrestoBatchWriter
– Delete	mutations	were	ignored
– Updating	was	manual	as	indexes	and	metrics	
needed	to	be	dropped/decremented
– Unable	to	use	AgeOffIterator
Datacatessen
Dropping/Updating	Data
• Led	to	API	improvements	within	the	
PrestoBatchWriter to	delete	and	update
– Supports	delete	mutations
– Properly	handles	deleting	index	entries,	decrementing	
metrics	entries,	and	creating	new	index/metric	entries
– Explicit	API	calls	to	change	the	values	of	columns
• Want	to	implement	DELETE	so	we	can	drop	data	
via	SQL
• LL:	Need	a	clean	API	to	delete	stuff
Datacatessen
Arbitrary	Row	Batching
• Connector	packs	50,000	row	IDs	into	a	split
– Split	is	the	unit	of	parallelism	in	Presto
– Resulted	in	a	variable	number	of	splits,	frequently	
creating	too	few	splits	and	not	leveraging	the	
distributed	nature	of	Presto
– Caused	problems	with	concurrent	queries	over	
larger	data	sets
Datacatessen
Arbitrary	Row	Batching
• Created	a	formula	for	determining	the	number	of	
splits	per	batch	(with	min/max)
– Number	of	splits	is	a	function	of	desired	number	of	
parallel	queries	and	number	of	concurrent	splits	per	
worker
• r:	number	of	rows	to	be	scanned
• s:	splits	per	worker
• w:	number	of	workers
• r	/	s	/	w
• LL:	Need	to	properly	generate	splits	for	maximum	
parallelization
Datacatessen
Bottleneck	in	Index	Retrieval
• Connector	would	regularly	spend	several	
seconds	fetching	row	IDs	from	the	index
– Process	was	single-threaded	and	non-distributed
– Regularly	retrieving	more	rows	than	necessary	
due	to	rows	being	filtered	via	predicates
Datacatessen
Row	IDs	where
user='adam'
Row	IDs	where
date='2017-10-16'
Row	IDS	where
user='adam'	AND	date='2017-10-16'
Bottleneck	in	Index	Retrieval
• Three	new	features/optimizations
– ThreadPool to	fetch	row	IDs	in	parallel
– Composite	indexes
– Distributing	the	index	lookup	to	Workers
• LL:	More	parallelism	and	more	indexes	make	
faster	queries
Datacatessen
'alice' and 'wendy'
'alice' and 'erin’ 'erin' and 'olivia' 'oscar' and 'wendy'
From	Coordinator
To	Worker
No	Query	History
• Presto’s	Coordinator	regularly	purges	past	
queries,	and	they	are	lost	on	a	restart	of	the	
Coordinator
– Presto	released	an	EventListener API	that	will	
provide	metrics	and	metadata	about	a	query	to	all	
implementors
Datacatessen
No	Query	History
• Implemented	an	AccumuloEventListener that	
archives	all	queries	in	a	Presto-Accumulo	table
– Table	is	queryable via	SQL
– Very	helpful	in	generating	usage	reports
– History	is	persisted	between	restarts
• LL:	Need	proper	storage	of	query	history
Datacatessen
No	Visibility	or	Timestamps
• Scope	of	the	use	case	expanded	to	require	
information	within	the	full	visibility	label
– No	way	to	access	this	information	from	a	Presto	
table’s	schema
Datacatessen
No	Visibility	or	Timestamps
• Added	support	for	hidden	visibility	and	
timestamp	columns
– Doesn’t	clutter	up	the	forward	facing	DDL
– Get	them	for	‘free’	for	all	non-row	ID	columns	(don’t	
have	to	explicitly	define	them)
– Available	via	SELECT
• <column_name>_vis
• <column_name>_ts
• LL:	Need	to	expose	visibility	and	timestamp	
information
Datacatessen
Large	Ad-hoc	Queries
• “Big”	scans	(millions	of	rows)	are	causing	
problems
– Occupying	scan	threads	on	TabletServers
– Blocked	other	queries	that	needed	near	real-time	
responses
Datacatessen
Large	Ad-hoc	Queries
• Still	not	really	solved	today
– We	reject	queries	that	will	scan	more	than	50	
million	rows
– Been	brain	storming	ideas	such	as	reading	Rfiles
• LL:	Don’t	use	the	Accumulo connector	for	
OLAP	queries
Datacatessen
Index	Hotspotting
• Indexing	1.0	was	a	quick	win
– Basic	reverse	index
– Caused	all	of	the	problems	you	want	to	avoid	with	
indexes
– Low	cardinality	columns	caused	very	wide	rows
– Timestamp	columns	are	monotonic	increasing
– Key	distribution	was	all	over	the	place	due	to	different	
data	types	in	the	same	table
– Deleting	columns	requires	configuring	RegexFilters
and	compacting/merging	the	tables
Datacatessen
Index	Hotspotting
• Large	refactoring	effort	for	Indexing	2.0
– Split	index	table	into	one	table	per	indexed	column
– Merged	metrics	into	index	table	(separate	locality	
group)
– Added	configurable	IndexStorage methods	that	would	
encode/decode	to	shard	or	post-fix	random	data
• LL:	Support	multiple	index	strategies	based	on	
the	type	of	data	being	stored
Datacatessen
Feature	Summary
• PrestoBatchWriter improvements	(deletions)
• Smarter	resource	allocation
• Composite	Indexes
• Accumulo	query	archive	(ROW	types)
• Distributed	Index	Lookup
• Hidden	visibility	and	timestamp	columns
• Error	on	reading	too	much	data
• Index	hotspotting
Datacatessen
Questions?
Datacatessen

More Related Content

PDF
Lessons Learned: Optimizing Accumulo as a Backend for User Applications
PDF
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
PDF
Auto-Train a Time-Series Forecast Model With AML + ADB
PDF
Observability for Data Pipelines With OpenLineage
PDF
Continuous Integration & Continuous Delivery
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PPTX
Top 5 Ways to Scale SQL with No New Hardware
PDF
Productionizing Machine Learning with a Microservices Architecture
Lessons Learned: Optimizing Accumulo as a Backend for User Applications
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Auto-Train a Time-Series Forecast Model With AML + ADB
Observability for Data Pipelines With OpenLineage
Continuous Integration & Continuous Delivery
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Top 5 Ways to Scale SQL with No New Hardware
Productionizing Machine Learning with a Microservices Architecture

What's hot (20)

PDF
Logging, Metrics, and APM: The Operations Trifecta
PPTX
Anomaly Detection using Spark MLlib and Spark Streaming
PDF
Managing Millions of Tests Using Databricks
PDF
Nine Publishing: Building a modern infrastructure with the Elastic Stack
PPT
Module Owb External Execution
PDF
Productionizing Machine Learning Pipelines with Databricks and Azure ML
PDF
InfoTrack: Creating a single source of truth with the Elastic Stack
PDF
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
PPSX
Priority Quick Tour
PDF
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
PDF
Advertising Fraud Detection at Scale at T-Mobile
PDF
Data Driven Decisions at Scale
PPTX
Tordatasci meetup-precima-retail-analytics-201901
PDF
Apache Stratos (incubating) Hangout IV - Stratos Controller and CLI Internals
PPTX
Simulating Radial and Axial Fan Performance
PPTX
Spark, Tachyon and Mesos internals
PDF
Deep Learning in the Cloud at Scale: A Data Orchestration Story
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PPTX
Scale out Magento 2 at AWS
PPTX
IBM Maximo Performance Tuning
Logging, Metrics, and APM: The Operations Trifecta
Anomaly Detection using Spark MLlib and Spark Streaming
Managing Millions of Tests Using Databricks
Nine Publishing: Building a modern infrastructure with the Elastic Stack
Module Owb External Execution
Productionizing Machine Learning Pipelines with Databricks and Azure ML
InfoTrack: Creating a single source of truth with the Elastic Stack
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Priority Quick Tour
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Advertising Fraud Detection at Scale at T-Mobile
Data Driven Decisions at Scale
Tordatasci meetup-precima-retail-analytics-201901
Apache Stratos (incubating) Hangout IV - Stratos Controller and CLI Internals
Simulating Radial and Axial Fan Performance
Spark, Tachyon and Mesos internals
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Scale out Magento 2 at AWS
IBM Maximo Performance Tuning
Ad

Similar to Presto/Accumulo: Lessons Learned (20)

PDF
Key to a successful Exadata POC
PPSX
Elastic-Engineering
PDF
206510 p6 upgrade considerations
PDF
Presto: Query Anything - Data Engineer’s perspective
PPTX
ODTUG_NoPlsql_vs_SmartDB_Part1_and_2.pptx
PDF
Past Experiences and Future Challenges using Automatic Performance Modelling ...
PDF
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
PDF
OpenCAPI next generation accelerator
PPTX
Webpack essentails - feb 19, 2020
PPTX
SQL TUNING 101
PDF
Summit Australia 2019 - PowerApps Component Framework (PCF) - Andrew Ly & Aun...
PDF
Con7091 sql tuning for expert db as-oow17_oct2_1507314871265001m0x4
PDF
Beyond SQL Tuning: Insider's Guide to Maximizing SQL Performance
PDF
Setting up the Oracle Optimizer for Proof of Concept Testing
PDF
Structured Streaming in Spark
PPTX
OOW13 Exadata and ODI with Parallel
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PDF
[DPE Summit] How Improving the Testing Experience Goes Beyond Quality: A Deve...
PDF
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Key to a successful Exadata POC
Elastic-Engineering
206510 p6 upgrade considerations
Presto: Query Anything - Data Engineer’s perspective
ODTUG_NoPlsql_vs_SmartDB_Part1_and_2.pptx
Past Experiences and Future Challenges using Automatic Performance Modelling ...
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
OpenCAPI next generation accelerator
Webpack essentails - feb 19, 2020
SQL TUNING 101
Summit Australia 2019 - PowerApps Component Framework (PCF) - Andrew Ly & Aun...
Con7091 sql tuning for expert db as-oow17_oct2_1507314871265001m0x4
Beyond SQL Tuning: Insider's Guide to Maximizing SQL Performance
Setting up the Oracle Optimizer for Proof of Concept Testing
Structured Streaming in Spark
OOW13 Exadata and ODI with Parallel
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
[DPE Summit] How Improving the Testing Experience Goes Beyond Quality: A Deve...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Ad

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
modul_python (1).pptx for professional and student
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Mega Projects Data Mega Projects Data
PPTX
Leprosy and NLEP programme community medicine
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
modul_python (1).pptx for professional and student
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Mega Projects Data Mega Projects Data
Leprosy and NLEP programme community medicine
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
STERILIZATION AND DISINFECTION-1.ppthhhbx
Optimise Shopper Experiences with a Strong Data Estate.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
climate analysis of Dhaka ,Banglades.pptx
[EN] Industrial Machine Downtime Prediction
.pdf is not working space design for the following data for the following dat...

Presto/Accumulo: Lessons Learned