APACHE ACCUMULO
From a design perspective
SCALABLE KEY-VALUE STORE
BASED ON GOOGLE'S
BIGTABLE
BIGTABLE FEATURES
• Distributes data across many commodity servers	

• Sorts data by key for fast lookup of values by key	

• Scan across multiple key value pairs	

• Highly consistent writes to single row	

• Support for MapReduce jobs
DATA MODEL
Key
Value
Row ID
Column
Timestamp
Family Qualifier
Row ID Col Fam Col Qual Timestamp Value
Bob Email id0023 20120301
Hey joe, can
you send ...
Bob Email id0024 20120302
Re: next
Thursday ...
Bob UserPrefs Background 20130101 Grey
Fred Email id0001 20080302
Welcome to
gmail ...
Sarah Email id0004 20130201 Hi again ...
Sara Videos ytid009 20100303
nsu736:)jdudjd
k$:)378;'$$)
Tablet servers HDFS DataNodes
Commit Layer Replication Layer
SINCE 2006
• Several BigTable implementations	

• Apache Hbase	

• Apache Cassandra	

• Apache Accumulo	

• others …
BIGTABLE IS BIGTABLE RIGHT?
HBASE
HBASE
• Open source Apache project started by developers at
Powerset, bought by Microsoft	

• Now used at Facebook, StumbleUpon, other big web sites	

• Fast reads	

• Row-oriented API	

• Each column family has it's own set of files
CASSANDRA
CASSANDRA
• Apache project started at Facebook	

• Combines elements of BigTable and Amazon's Dynamo
into one system	

• Used at Netflix, other web sites	

• Fast writes	

• Tunable consistency
Tablet servers
Commit and Replication Layer
CONSISTENCY
• Highly consistent means: writes in one place	

• Eventually consistent: writes in > one place	

• Writes in > one place: network partition tolerance	

• Partition tolerance: geographically distributed servers	

• *Google uses Spanner to synchronize multiple dbs
Tablet servers
Data Center A Data Center B
Data Center A Data Center B
Tablet servers
OVERVIEW
• Both highly scalable	

• Used to build web applications that can serve millions of
users at once	

• Serves as a low-latency persistence layer for real time
service of requests	

• Available in single data center or cross data center options
USE CASE
• Most data comes from users	

• Schema defined by the application	

• Data builds up over time
Many UsersDb
Web
application
ACCUMULO
ACCUMULO
• Can support the web application use-case	

• But what are those other extra features for?
ACCUMULO ‘EXTRAS’
• Dynamic Column Families	

• ColumnVisibility	

• Key-value oriented API	

• Iterators	

• Batch Scanners
BIG ORGANIZATIONS
• Missions other than internet services	

• Various disparate operational systems that
generate data	

• Desire to look across and analyze that data	

• Desire to deliver results to their own population
USE CASE IS DISCOVERING
AND ANALYZING ALL DATA
ISSUES
• Scale	

• Unknown / multiple schema	

• Support for analysis without data movement	

• Varying levels of sensitivity in the same system	

• Support a high number of low-latency user requests
Many Users
Analyze
Db
Data sets
SCALE?
CHECK	

(IT’S BIGTABLE)
NO CONTROL OVER OR
MANY DIFFERENT SCHEMA?
MAP EXISTING FIELDSTO
COLUMNS DYNAMICALLY
INCLUDING COLUMN
FAMILIES
VARYING LEVELS OF DATA
SENSITIVITY?
COLUMNVISIBILITY
DATA MODEL
Key
Value
Row ID
Column
Time
stamp
Family Qualifier Visibility
Row ID Col Fam Col Qual Col Vis Timestamp Value
Bob Email id0023
personal
comms
20120301
Hey joe, can
you send ...
Bob Email id0024
personal
comms
20120302
Re: next
Thursday ...
Bob UserPrefs Background prefs 20130101 Grey
Fred Email id0001
personal
comms
20080302
Welcome to
gmail ...
Sarah Email id0004
personal
comms
20130201 Hi again ...
Sara Videos ytid009 public post 20100303
nsu736:)jdu
djdk
$:)378;'$$)
DATA OFVARYING
SENSITIVITY LEVELS CAN BE
PHYSICALLY CO-LOCATED
FRAMEWORKS LIKE HADOOP
MAP REDUCE LOVE IT WHEN
DATA IS ALLTOGETHER
LOOK ACROSS DATASETS?
SECONDARY INDICES
SECONDARY INDICES
• Application-created data: known	

• Pre-existing data? unknown
DATA DISCOVERY!
SECONDARY INDICES
RowID	

 Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
RowID	

 Col Fam Col Qual
43 age RID00003
54 age RID00001
5’9” height RID00003
bob name RID00001
carl name RID00004
evan name RID00005
fred name RID00002
harry name RID00003
PARTIAL ROW SCANS
BATCH SCANNERS
RowID	

 Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
Batch Scanner
COLUMNVISIBILITY APPLIES
TO INDEXESTOO
ANALYSIS?
MAPREDUCE: CHECK
SHUFFLE-SORTED?
• Between Map and Reduce phases is shuffle-sort	

• Sorting by key is necessary so all the values for a
given key end up next to each other …	

• BigTable also sorts keys …
ITERATORS
Value combine(Iterator<Value> values)
PRE-COMPUTATION
Many Users
Analyze
Db
Data sets
ACCUMULO

More Related Content

PDF
Introduction to Apache Accumulo
PPTX
SQL Server Denali: BI on Your Terms
PPTX
A Practical Look at the NOSQL and Big Data Hullabaloo
PPTX
Accumulo meetup 20130109
PPTX
Introduction to Apache Accumulo
PDF
Introduction to Accumulo
PPTX
Introduction to Apache Accumulo
PPTX
An Introduction to Accumulo
Introduction to Apache Accumulo
SQL Server Denali: BI on Your Terms
A Practical Look at the NOSQL and Big Data Hullabaloo
Accumulo meetup 20130109
Introduction to Apache Accumulo
Introduction to Accumulo
Introduction to Apache Accumulo
An Introduction to Accumulo

Similar to Accumulo design (20)

PDF
Slide presentation pycassa_upload
PDF
AWS Athena vs. Google BigQuery for interactive SQL Queries
PPTX
NoSql Brownbag
PPTX
PASS_Summit_2019_Azure_Storage_Options_for_Analytics
PPTX
Evolution of the DBA to Data Platform Administrator/Specialist
PPTX
Sharing a Startup’s Big Data Lessons
PPTX
PPTX
Relational databases vs Non-relational databases
PPTX
How to Survive as a Data Architect in a Polyglot Database World
PPTX
Dapper: the microORM that will change your life
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
PDF
Modern data warehouse with Azure
PPTX
NoSQL
KEY
NoSQL databases and managing big data
PPTX
Survey of the Microsoft Azure Data Landscape
PDF
A Real World Guide to Building Highly Available Fault Tolerant SharePoint Farms
PDF
Cassandra Basics, Counters and Time Series Modeling
PPTX
Webinar: Utilisations courantes de MongoDB
PPTX
Neo4j Training Introduction
PPTX
Graphs fun vjug2
Slide presentation pycassa_upload
AWS Athena vs. Google BigQuery for interactive SQL Queries
NoSql Brownbag
PASS_Summit_2019_Azure_Storage_Options_for_Analytics
Evolution of the DBA to Data Platform Administrator/Specialist
Sharing a Startup’s Big Data Lessons
Relational databases vs Non-relational databases
How to Survive as a Data Architect in a Polyglot Database World
Dapper: the microORM that will change your life
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern data warehouse with Azure
NoSQL
NoSQL databases and managing big data
Survey of the Microsoft Azure Data Landscape
A Real World Guide to Building Highly Available Fault Tolerant SharePoint Farms
Cassandra Basics, Counters and Time Series Modeling
Webinar: Utilisations courantes de MongoDB
Neo4j Training Introduction
Graphs fun vjug2
Ad

Recently uploaded (20)

PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
chrmotography.pptx food anaylysis techni
PPTX
modul_python (1).pptx for professional and student
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Managing Community Partner Relationships
PPTX
Steganography Project Steganography Project .pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
STERILIZATION AND DISINFECTION-1.ppthhhbx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
chrmotography.pptx food anaylysis techni
modul_python (1).pptx for professional and student
Optimise Shopper Experiences with a Strong Data Estate.pdf
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Introduction to Inferential Statistics.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Pilar Kemerdekaan dan Identi Bangsa.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Topic 5 Presentation 5 Lesson 5 Corporate Fin
retention in jsjsksksksnbsndjddjdnFPD.pptx
Managing Community Partner Relationships
Steganography Project Steganography Project .pptx
SAP 2 completion done . PRESENTATION.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Ad

Accumulo design