SlideShare a Scribd company logo
Effective	Testing	of
Apache	Accumulo	Iterators
Josh	Elser
Accumulo	Summit	2016
2016/10/11
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Engineer	at	Hortonworks,	Member of	the	Apache	Software	Foundation
Top-Level	Projects
• Apache	Accumulo®
• Apache	Calcite™
• Apache	Commons	™
• Apache	HBase	®
• Apache	Phoenix	™
ASF	Incubator
• Apache	Fluo ™
• Apache	Gossip	™
• Apache	Pirk ™
• Apache	Rya ™
• Apache	Slider	™
These	Apache	project	names	are	trademarks	or	registered
trademarks	of	the	Apache	Software	Foundation.
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
A	Novel	Feature	of	Apache	Accumulo
à SortedKeyValueIterator (SKVI	or	“Iterators”)
à Computation	offload
à Reduced	I/O
à Rumored	to	be	called	“cool”	by	Jeff	Dean
Transformations
Server-Side
Predicate-Pushdown
Filters
Aggregations
Combiners
Versioning
Security
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Accumulo	Iterators
à Column	Slices	(CfCqSliceFilter)
à Basic	Statistics	(StatsCombiner)
à Value/Array	Concatenation	(Summing[Array]Combiner)
à Aggregations	(WholeRowIterator,	WholeColumnFamilyIterator)
à In-Row	operations	(AndIterator,	OrIterator)
à Filters	(RegExFilter,	GrepIterator,	FirstEntryInRowIterator)
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Reads
à Clients	request	a	Range	of	data
à Key	to	Row	to	Tablet	to	TabletServer
à Sorted,	merged-read	of	memory	and	files
à Computation	offload	and	RPC	boost
Tablet
Memory RFile
RFile
RFile
RFile
RFile
Client
Iterators
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Reads	with	Iterators
à A	poor-man’s	“VIEW”
à Server-side	transformation	at	query-time
Raw Key Value Transformed Key Value
3141592 siblings:brothers Bobby,Steven 3141592 siblings:count 4
3141592 siblings:sisters Sally,Francine
3141593 siblings:brothers Frank 3141593 siblings:count 3
3141593 siblings:sisters Amy,Loretta
3141594 siblings:brothers 3141594 siblings:count 2
3141594 siblings:sisters Rebecca,Savannah
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Compactions
à Bounds	number	of	files	and	performance
à Iterators	provide	data	optimization	mechanism
Tablet
RFile
RFile
RFile
RFile
RFile
RFile
RFile
Before After
Iterators
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Compactions	with	Iterators
à Deferred	aggregation
à Rewrite	application	data	in	optimal	form
Raw Key Value Transformed Key Value
3141592 siblings:brothers Bobby,Steven 3141592 siblings:brothers …
3141592 siblings:count 4
3141592 siblings:sisters Sally,Francine 3141592 siblings:sisters …
3141593 siblings:brothers Frank 3141593 siblings:brothers …
3141593 siblings:count 3
3141593 siblings:sisters Amy,Loretta 3141593 siblings:sisters …
3141594 siblings:brothers 3141594 siblings:brothers …
3141594 siblings:counts 2
3141594 siblings:sisters Rebecca,Savannah 3141594 siblings:sisters …
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Better	for	Everyone
à Iterators	are	great
– Abstraction	for	system-level	filters	and	optimizations
– Better	performance	for	power-users	
à Lots	of	things	Iterators	are	not
– Triggers
– Hooks
– Coprocessors
– “Hammers”
à Iterators	do	not	generally	replace
– Flink,	Hive,	Mesos,	Presto,	Storm,	Spark,	YARN,	etc
– Can	in	some cases
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
On	Building	an	Iterator
à The	API	is	not	particularly	intuitive
à Hard	to	create/support	SKVIv2
à Edge-cases	in	production	are	hard	to	
understand
à Lots	of	things	to	not do	in	an	Iterator
– Trial	and	error
à Difficult	insight	in	production	systems
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Good
– Fast
– Concise/Simple
– Given	input,	verify	output
à Bad
– Not	end-to-end
– Not	representative	invocation
Unit	Testing
à Good
– Same	server	execution	as	production
– Same	client	interaction	as	production
à Bad
– Slow/Memory	intensive
– Pedantic	to	write	tests
– Might	not	catch	production	edge-cases
– Impacted	by	environment
MiniAccumuloCluster	(MAC)	Testing
Existing	Testing	Tools
What’s	the	happy	medium?
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Iterator	Testing	Harness
à Testing	harness	designed	to	capture	common	pitfalls
– ACCUMULO-626	in	>=1.8.0
à Complementary
à The	good	parts
– Fast
– Generalized/Reusable	tests
– Extensible
à The	bad	parts
– Not	directly	using	TabletServer	for	invocation
– Subtle	failures
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Iterator	Testing	Harness
à Testing	an	Iterator	requires	three	things
– Input	data
– Expected	output
– Collection	of	test	cases	to	run
à Test	cases	found	via	reflection
– Common	edge	cases	provided
– Easy	to	develop	and	run	new	test	cases
à JUnit4	integration
@Parameters
public static Object[][] data() {
IteratorTestInput input = createIteratorInput();
IteratorTestOutput expectedOutput = createIteratorOuput();
List<IteratorTestCase> testCases = createTestCases();
return BaseJUnit4IteratorTest.createParameters(input,
expectedOutput, testCases);
}
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Example	Test	Cases
à Iterator	Instantiation
– Does	the	Iterator	have	a	visibile no-args constructor?
à ”DeepCopy”	safety
– Can	a	“deepCopy()”	of	an	Iterator	be	used	like	the	original?
à Stateless	“hasTop()”
– Do	multiple	invocations	of	“hasTop()”	cause	incorrect	results/errors?
à Re-seek()’ing
– Accumulo	will	re-instantiate	scan	sessions	and	use	new	Ranges
– Does	the	Iterator	still	return	correct	results	in	this	case?
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
In	an	Ideal	World
à Good	testing	means	faster	deployments
à Faster	deployment	means	more	value	for	customers
à Automated	tests	combats	technical	debt	in	code	growth
à More	automation	reduces	developer	stress
Unit	Tests MiniAccumuloCluster Iterator	Testing	Harness+ + =
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
In	an	Ideal	World
à Unit	Tests	(test lifecycle	phase)
– Fast	verification	given	input/output
– Validate	impact	of	state
à Iterator	Testing	Harness (test lifecycle	phase)
– Catch	common-mistakes
– Basic	lifetime/API	validation
– Encourage	best-practices
à MiniAccumuloCluster(integration-test	lifecycle	phase)
– Functional/Acceptance	tests
– Does	the	ingest/query	system	function
– Real	execution	of	Iterator	by	TabletServer
A	Trio	of	Testing	Approaches
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Standalone	environment
– The	”laptop	test”
– Sanity	check
à Staging	environments
– Small	cluster	with	a	subset	of	data
– Correctness	and	performance
In	an	Ideal	World
Code
MAC
Iterator
Test	Harness
Unit	Tests
Binary
Artifacts
Standalone
Staging
Production
Deploy
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
In	an	Ideal	World
à No	more	”voodoo”	and	“black	magic”
à Find	common	errors	fast
à Catch	bad	Iterator	design	early
à Standardized	testing	methodology
à Community	contributes	new	tests
à Increase	in	quality,	reusability,	and	confidence
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Thank	You
Twitter:	@josh_elser
Email: elserj@apache.org	/	jelser@hortonworks.com

More Related Content

PDF
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
PPTX
Apache Ambari - What's New in 2.4
PPTX
Apache Accumulo 1.8.0 Overview
PDF
Next Generation Execution for Apache Storm
PPTX
Apache Ambari: Past, Present, Future
PPTX
Hive2.0 big dataspain-nov-2016
PPTX
Transactional SQL in Apache Hive
PPTX
ORC File - Optimizing Your Big Data
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Apache Ambari - What's New in 2.4
Apache Accumulo 1.8.0 Overview
Next Generation Execution for Apache Storm
Apache Ambari: Past, Present, Future
Hive2.0 big dataspain-nov-2016
Transactional SQL in Apache Hive
ORC File - Optimizing Your Big Data

What's hot (20)

PPTX
S3Guard: What's in your consistency model?
PPTX
Managing Enterprise Hadoop Clusters with Apache Ambari
PPTX
Apache Ambari - What's New in 2.0.0
PPTX
Apache Ambari Meetup - AMS & Grafana
PPTX
Streamline Hadoop DevOps with Apache Ambari
PPTX
An Overview on Optimization in Apache Hive: Past, Present, Future
PPTX
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
PPTX
Apache Slider
PDF
Spark Security
PPTX
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
PPTX
Apache Ambari - What's New in 2.1
PDF
Introduction to Hortonworks Data Cloud for AWS
PPTX
Hive ACID Apache BigData 2016
PPTX
Ozone- Object store for Apache Hadoop
PPTX
Performance Update: When Apache ORC Met Apache Spark
PPTX
Double Your Hadoop Hardware Performance with SmartSense
PPTX
Apache Phoenix Query Server PhoenixCon2016
PDF
Hortonworks Technical Workshop: Apache Ambari
PPT
Running Apache Spark & Apache Zeppelin in Production
PPTX
Apache Hadoop 0.23
S3Guard: What's in your consistency model?
Managing Enterprise Hadoop Clusters with Apache Ambari
Apache Ambari - What's New in 2.0.0
Apache Ambari Meetup - AMS & Grafana
Streamline Hadoop DevOps with Apache Ambari
An Overview on Optimization in Apache Hive: Past, Present, Future
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Apache Slider
Spark Security
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Apache Ambari - What's New in 2.1
Introduction to Hortonworks Data Cloud for AWS
Hive ACID Apache BigData 2016
Ozone- Object store for Apache Hadoop
Performance Update: When Apache ORC Met Apache Spark
Double Your Hadoop Hardware Performance with SmartSense
Apache Phoenix Query Server PhoenixCon2016
Hortonworks Technical Workshop: Apache Ambari
Running Apache Spark & Apache Zeppelin in Production
Apache Hadoop 0.23
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Introduction to the R Programming Language
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Database Infoormation System (DBIS).pptx
PPTX
1_Introduction to advance data techniques.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to machine learning and Linear Models
SAP 2 completion done . PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
Introduction to the R Programming Language
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
Database Infoormation System (DBIS).pptx
1_Introduction to advance data techniques.pptx
Ad

Accumulo Summit 2016: Effective Testing of Apache Accumulo Iterators