SlideShare a Scribd company logo
1© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
How to use this presentation
• Covered topics: Accumulo architecture, operational maintenance,
fault handling
• Intended Audience: Developers, supporters, PMs who are
conversant in multi-component systems, i.e. involved in web
services.
• Presumes familiarity with RDBMS
• Expected running time: 40 - 60 minutes
• License: CC-BY-SA 2.0
• Please let me know if you find it useful and what it could use:
busbey@cloudera.com
Introduction to
Apache Accumulo
Scaling a web application made easier
Sean Busbey // Software Engineer
3© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Let’s talk about Apache Accumulo…
4© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
But in the context of a specific use case
•I really like technology that solves a
problem.
•Keep in mind that this won’t be
exhaustive.
•YMMV, proof-of-concepts with metrics
are better than slides.
5© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Who am I?
• Apache Accumulo PMC
• Apache HBase committer
• Software Engineer on Cloudera’s storage team
6© 2015 Cloudera licensed CC-BY-SA 2.0
That is to say, I
work for a vendor
and no longer have
operational scale
problems of my
own.
We’ll focus on an
application that
enables
conversations
centered on cute
cats.
8© 2015 Cloudera licensed CC-BY-SA 2.0
9© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Simple sharing model built with privacy
controls
•User defines a group that may see their
posting
•User posts a picture to a given group
•Members of the group may write short
messages
10© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Straight forward web architecture
11© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Relational Data Model
Will map user names to
identifiers used elsewhere.
Will track ownership and
descriptive name.
Will allow users to add and
remove members.
User table Group table Group membership
table
12© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Relational Data Model
Tracks distribution group,
owner, and topical image.
Individual comments from
users.
Topic table Comment table
13© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
First growth: robustness
14© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
First growth: robustness
15© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Second growth: application scale out
16© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Scaling reads: what goes into this page?
17© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Database reads eventually become a
bottleneck
18© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Scale by de-normalizing in favor of reads
19© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Change to writes - original
20© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Change to writes – de-normalized
Generally known
as the fan-out
pattern.
21© 2015 Cloudera licensed CC-BY-SA 2.0
22© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
The trick is to not get crushed by the writes
•Each poster now does a write for each
member of the group a post goes to.
•Removing access is now a much larger
delete query.
•Most databases are geared toward few
writes and many reads; are we
screwed?
23© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Recall our access pattern
Basically one of
these consumer
boxes.
24© 2015 Cloudera licensed CC-BY-SA 2.0
25© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Lines up very well with sharding
•Divide the query space up by e.g. a
hash of user id into n shards.
•Store a copy of table on each shard,
but just for user ids that hash to that
shard.
•Reads and writes are spread across
instances.
26© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Database shards Layout
27© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
What were the nice-to-haves for the RDBMS
again?
• No longer leveraging relational data model.
• Now running, backing up, and failing over num shards number of
database instances.
• Robustness in a shard has to be managed.
• Sharding is essentially static; adding more resources with growth still
painful.
28© 2015 Cloudera licensed CC-BY-SA 2.0
Now we have some
context for
Accumulo.
Our goal is to end up with less operational overhead.
29© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
“The Apache Accumulo™
sorted, distributed
key/value store is a robust,
scalable, high
performance data storage
and retrieval system.”
Accumulo PMC via https://accumulo.apache.org/
30© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Accumulo-based App Layout
31© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
“The Apache Accumulo™
sorted, distributed
key/value store is a robust,
scalable, high
performance data storage
and retrieval system.”
Accumulo PMC via https://accumulo.apache.org/
32© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
In Accumulo, you address cells rather than
records
Key Valu
e
33© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Keys are multi-dimensional
Key Valu
e
Ro
w
Column Tim
e
34© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Keys are multi-dimensional
Key Valu
e
Ro
w
Column Tim
eFamily Qualifier Visibility
35© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Accumulo doesn’t assume a schema
•All key and value components, save time, are
byte[]
•The application is responsible for
serialization
•Common to use different serialization for the
values in different columns.
36© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Mapping records to cells
•Treat a row as a database
• Essentially each column is a record field
•Treat each cell as a database record
• Need to uniquely identify each record
• Useful if you generally need the whole row and not
a subset of columns
• Can then treat each row as a shard of database
records.
37© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Let’s use a concrete example.
38© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Already know our reads are within a shard.
39© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Mapping our data into cells
Key Value
Row Column Family Column Qualifier Visibility author, image url,
and comment
reader id discussion id comment order group id
40© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
We end up with something close to our
original.
41© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Note the use of visibility
42© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Visibility enforcement
•At scan time, our application will pass in the
groups for the current user.
•Accumulo will filter any cells that don’t match
those groups.
• Group removal is a simple update in the group
management system again.
43© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Sparse column storage
•We are creating lots of columns: per
discussion per group member.
•Accumulo only stores columns that exist in a
given row.
44© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
“The Apache Accumulo™
sorted, distributed
key/value store is a robust,
scalable, high
performance data storage
and retrieval system.”
Accumulo PMC via https://accumulo.apache.org/
45© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
All cells sorted according to key
• Total ordering based on lex-sort of raw byte arrays
of key components.
• Time is sorted most-recent-first
• Reads are done on a contiguous range of cells.
46© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
When sorted our data looks like this….
47© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
And the scan for a page is roughly…
48© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Lexicoders
• Turning different kinds of data into sortable bytes is painful
• Accumulo ships implementations for several common Java
types
• Also for e.g. reversing the sort order and building compound
keys.
49© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Inefficiencies in our data model
Key Value
Row Column Family Column Qualifier Visibility author, image url,
and comment
reader id discussion id comment order group id
50© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Two categories of data
Key Value
Row Column Family Column Qualifier Visibility author, image url
reader id discussion id image group id
Key Value
Row Column Family Column Qualifier Visibility author, comment
reader id discussion id text group id
51© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
And now our data looks like this
52© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
And the scan for a page covers less data
53© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
“The Apache Accumulo™
sorted, distributed
key/value store is a robust,
scalable, high
performance data storage
and retrieval system.”
Accumulo PMC via https://accumulo.apache.org/
54© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Our simplified diagram
55© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Slightly less simplified
56© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Back to the data model
Key Valu
e
Ro
w
Column Tim
eFamily Qualifier Visibility
57© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Back to the data model
Key Valu
e
Ro
w
Column Tim
eFamily Qualifier Visibility
58© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Rows are grouped into Tablets
• Tablet is defined by a start and end row
• All cells for a given row must be in the same Tablet.
59© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Tablets are assigned to Tablet Servers
• At any given point in time, a Tablet is serviced by a single Tablet
Server
60© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Slightly less simplified
61© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Tablets are assigned to Tablet Servers
• At any given point in time, a Tablet is serviced by a single Tablet
Server
• That server is responsible for client reads and writes to all hosted
Tablets
• Finding the proper server is handled by the Accumulo libraries
• Proper key design means io load gets spread across multiple
machines
62© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
“The Apache Accumulo™
sorted, distributed
key/value store is a robust,
scalable, high
performance data storage
and retrieval system.”
Accumulo PMC via https://accumulo.apache.org/
63© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Tablet assignment is not static
• Assignment tend to have steady state
• But can move in the event of new resources or failure
64© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Remember our RDBMS scaling?
65© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
New RDBMS shard
1. Provision hardware for service
2. Rewrite data under new sharding
3. Update application services
• Doing this without an outage is hard work (and well paid if you can
get it)
66© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
New Accumulo Tablet Server
1. Provision hardware for service
2. Add server to cluster
3. Tablets automatically migrate from busier nodes to new node
• No outage from client perspective.
67© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
“The Apache Accumulo™
sorted, distributed
key/value store is a robust,
scalable, high
performance data storage
and retrieval system.”
Accumulo PMC via https://accumulo.apache.org/
68© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
All distributed systems have communication
failures
In the face of such a failure you can either
• remain available on remaining nodes to all clients
• provide a consistent view of updates to a subset of
clients
69© 2015 Cloudera licensed CC-BY-SA 2.0
Now you know the
basics of CAP
Remember that you can’t give up partition tolerance
70© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Remember our RDBMS robustness?
71© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Accumulo is a CP system
• Tablet Servers ensure that updates have been written to a distributed
write-ahead-log before acknowledging
• Tablet Server failures are automatically detected
• Newly assigned hosts for recovered Tablets then replay edits up until
last ack before serving new requests
72© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
73© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Client
write
74© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Write goals
• Low latency ack
• Don’t lose acked writes in face of node failure
75© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Client
write
1
76© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Client
write
1
2
77© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Client
write
1
2
3
78© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
79© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
80© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
81© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
82© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
83© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Recovery timing
• Tunable time to detection – increases network load
• Size of outstanding write ahead logs
84© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Client
write
1
2
3
4
85© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Accumulo-based App Layout
86© 2015 Cloudera licensed CC-BY-SA 2.0
What’s the catch?
87© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
Gaps
• Still requires application updates to use API – no interactive SQL
bindings*
• No Disaster Recovery – coming in next minor release
Thank you.
Mr. Mean photo from mockup is © 2004 Flickr user
aznewbeginning; cc-by-sa 2.0 https://flic.kr/p/4uzdRc

More Related Content

PPTX
An Introduction to Accumulo
PDF
Comparing Accumulo, Cassandra, and HBase
PDF
Matching the Scale at Tinder with Kafka
PPTX
Scaling Push Messaging for Millions of Netflix Devices
PPTX
IBM Spectrum Scale Authentication For Object - Deep Dive
PPT
Less08 users
PDF
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
PDF
Introduction to Big Data
An Introduction to Accumulo
Comparing Accumulo, Cassandra, and HBase
Matching the Scale at Tinder with Kafka
Scaling Push Messaging for Millions of Netflix Devices
IBM Spectrum Scale Authentication For Object - Deep Dive
Less08 users
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Introduction to Big Data

What's hot (20)

PDF
Cloud9で試すAWS IoT Greengrass V2
PPTX
Deep Dive into Apache Kafka
PDF
Caffe - A deep learning framework (Ramin Fahimi)
PDF
Power up Your BI with Geospatial Data
PPTX
Overview on Azure Machine Learning
PDF
20190226 AWS Black Belt Online Seminar Amazon WorkSpaces
PDF
Db2をAWS上に構築する際のヒント&TIPS 2020年6月版
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PDF
WekaIO: Making Machine Learning Compute Bound Again
PDF
AWS Black Belt Techシリーズ Amazon EMR
PPTX
Customer Presentation - Aruba Wi-Fi Overview (1).PPTX
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PPTX
Introduction to MongoDB and CRUD operations
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
PPTX
Oracleからamazon auroraへの移行にむけて
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PDF
Using Azure Compute with VMSS, Kubernetes, and Service Fabric
PDF
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
PDF
AWS Single Sign-On (SSO) 서비스 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Cloud9で試すAWS IoT Greengrass V2
Deep Dive into Apache Kafka
Caffe - A deep learning framework (Ramin Fahimi)
Power up Your BI with Geospatial Data
Overview on Azure Machine Learning
20190226 AWS Black Belt Online Seminar Amazon WorkSpaces
Db2をAWS上に構築する際のヒント&TIPS 2020年6月版
Hive, Presto, and Spark on TPC-DS benchmark
WekaIO: Making Machine Learning Compute Bound Again
AWS Black Belt Techシリーズ Amazon EMR
Customer Presentation - Aruba Wi-Fi Overview (1).PPTX
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Introduction to MongoDB and CRUD operations
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Oracleからamazon auroraへの移行にむけて
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Using Azure Compute with VMSS, Kubernetes, and Service Fabric
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
AWS Single Sign-On (SSO) 서비스 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Ad

Similar to Introduction to Apache Accumulo (20)

PPTX
Decoupling Decisions with Apache Kafka
PPTX
Kafka for DBAs
PDF
VMworld Europe 2014: What’s New in End User Computing: Full Desktop Automatio...
PDF
CloudStack - Top 5 Technical Issues and Troubleshooting
PDF
intro-kafka
PPTX
Docker Swarm vs. Kubernetes Which is the best
PPTX
Paul Angus - CloudStack Container Service
PDF
Představení Oracle SPARC Miniclusteru
PPTX
Security posture improvements for Google Cloud Platform
PDF
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
PDF
Best Practices For Workflow
PDF
Enterprise Management for the AWS Cloud
PPTX
Cloud as a Service: A Powerful New Cloud Management Platform
PPTX
The Reality of DIY Kubernetes vs. PKS
PDF
Migrating Your Windows Datacenter to AWS
PPTX
Kafka/SMM Crash Course
PDF
oci-container-engine-oke-100.pdf
PDF
CloudStack Container Service
PPTX
Connecting All Abstractions with Istio
PPTX
Azure News Slides for October2017 - Azure Nights User Group
Decoupling Decisions with Apache Kafka
Kafka for DBAs
VMworld Europe 2014: What’s New in End User Computing: Full Desktop Automatio...
CloudStack - Top 5 Technical Issues and Troubleshooting
intro-kafka
Docker Swarm vs. Kubernetes Which is the best
Paul Angus - CloudStack Container Service
Představení Oracle SPARC Miniclusteru
Security posture improvements for Google Cloud Platform
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
Best Practices For Workflow
Enterprise Management for the AWS Cloud
Cloud as a Service: A Powerful New Cloud Management Platform
The Reality of DIY Kubernetes vs. PKS
Migrating Your Windows Datacenter to AWS
Kafka/SMM Crash Course
oci-container-engine-oke-100.pdf
CloudStack Container Service
Connecting All Abstractions with Istio
Azure News Slides for October2017 - Azure Nights User Group
Ad

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
August Patch Tuesday
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Zenith AI: Advanced Artificial Intelligence
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A novel scalable deep ensemble learning framework for big data classification...
August Patch Tuesday
SOPHOS-XG Firewall Administrator PPT.pptx
A comparative study of natural language inference in Swahili using monolingua...
DP Operators-handbook-extract for the Mautical Institute
1 - Historical Antecedents, Social Consideration.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
A comparative analysis of optical character recognition models for extracting...
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Hindi spoken digit analysis for native and non-native speakers
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
cloud_computing_Infrastucture_as_cloud_p
Web App vs Mobile App What Should You Build First.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf

Introduction to Apache Accumulo

  • 1. 1© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 How to use this presentation • Covered topics: Accumulo architecture, operational maintenance, fault handling • Intended Audience: Developers, supporters, PMs who are conversant in multi-component systems, i.e. involved in web services. • Presumes familiarity with RDBMS • Expected running time: 40 - 60 minutes • License: CC-BY-SA 2.0 • Please let me know if you find it useful and what it could use: busbey@cloudera.com
  • 2. Introduction to Apache Accumulo Scaling a web application made easier Sean Busbey // Software Engineer
  • 3. 3© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Let’s talk about Apache Accumulo…
  • 4. 4© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 But in the context of a specific use case •I really like technology that solves a problem. •Keep in mind that this won’t be exhaustive. •YMMV, proof-of-concepts with metrics are better than slides.
  • 5. 5© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Who am I? • Apache Accumulo PMC • Apache HBase committer • Software Engineer on Cloudera’s storage team
  • 6. 6© 2015 Cloudera licensed CC-BY-SA 2.0 That is to say, I work for a vendor and no longer have operational scale problems of my own.
  • 7. We’ll focus on an application that enables conversations centered on cute cats.
  • 8. 8© 2015 Cloudera licensed CC-BY-SA 2.0
  • 9. 9© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Simple sharing model built with privacy controls •User defines a group that may see their posting •User posts a picture to a given group •Members of the group may write short messages
  • 10. 10© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Straight forward web architecture
  • 11. 11© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Relational Data Model Will map user names to identifiers used elsewhere. Will track ownership and descriptive name. Will allow users to add and remove members. User table Group table Group membership table
  • 12. 12© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Relational Data Model Tracks distribution group, owner, and topical image. Individual comments from users. Topic table Comment table
  • 13. 13© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 First growth: robustness
  • 14. 14© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 First growth: robustness
  • 15. 15© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Second growth: application scale out
  • 16. 16© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Scaling reads: what goes into this page?
  • 17. 17© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Database reads eventually become a bottleneck
  • 18. 18© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Scale by de-normalizing in favor of reads
  • 19. 19© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Change to writes - original
  • 20. 20© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Change to writes – de-normalized
  • 21. Generally known as the fan-out pattern. 21© 2015 Cloudera licensed CC-BY-SA 2.0
  • 22. 22© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 The trick is to not get crushed by the writes •Each poster now does a write for each member of the group a post goes to. •Removing access is now a much larger delete query. •Most databases are geared toward few writes and many reads; are we screwed?
  • 23. 23© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Recall our access pattern
  • 24. Basically one of these consumer boxes. 24© 2015 Cloudera licensed CC-BY-SA 2.0
  • 25. 25© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Lines up very well with sharding •Divide the query space up by e.g. a hash of user id into n shards. •Store a copy of table on each shard, but just for user ids that hash to that shard. •Reads and writes are spread across instances.
  • 26. 26© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Database shards Layout
  • 27. 27© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 What were the nice-to-haves for the RDBMS again? • No longer leveraging relational data model. • Now running, backing up, and failing over num shards number of database instances. • Robustness in a shard has to be managed. • Sharding is essentially static; adding more resources with growth still painful.
  • 28. 28© 2015 Cloudera licensed CC-BY-SA 2.0 Now we have some context for Accumulo. Our goal is to end up with less operational overhead.
  • 29. 29© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 “The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.” Accumulo PMC via https://accumulo.apache.org/
  • 30. 30© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Accumulo-based App Layout
  • 31. 31© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 “The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.” Accumulo PMC via https://accumulo.apache.org/
  • 32. 32© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 In Accumulo, you address cells rather than records Key Valu e
  • 33. 33© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Keys are multi-dimensional Key Valu e Ro w Column Tim e
  • 34. 34© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Keys are multi-dimensional Key Valu e Ro w Column Tim eFamily Qualifier Visibility
  • 35. 35© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Accumulo doesn’t assume a schema •All key and value components, save time, are byte[] •The application is responsible for serialization •Common to use different serialization for the values in different columns.
  • 36. 36© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Mapping records to cells •Treat a row as a database • Essentially each column is a record field •Treat each cell as a database record • Need to uniquely identify each record • Useful if you generally need the whole row and not a subset of columns • Can then treat each row as a shard of database records.
  • 37. 37© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Let’s use a concrete example.
  • 38. 38© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Already know our reads are within a shard.
  • 39. 39© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Mapping our data into cells Key Value Row Column Family Column Qualifier Visibility author, image url, and comment reader id discussion id comment order group id
  • 40. 40© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 We end up with something close to our original.
  • 41. 41© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Note the use of visibility
  • 42. 42© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Visibility enforcement •At scan time, our application will pass in the groups for the current user. •Accumulo will filter any cells that don’t match those groups. • Group removal is a simple update in the group management system again.
  • 43. 43© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Sparse column storage •We are creating lots of columns: per discussion per group member. •Accumulo only stores columns that exist in a given row.
  • 44. 44© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 “The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.” Accumulo PMC via https://accumulo.apache.org/
  • 45. 45© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 All cells sorted according to key • Total ordering based on lex-sort of raw byte arrays of key components. • Time is sorted most-recent-first • Reads are done on a contiguous range of cells.
  • 46. 46© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 When sorted our data looks like this….
  • 47. 47© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 And the scan for a page is roughly…
  • 48. 48© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Lexicoders • Turning different kinds of data into sortable bytes is painful • Accumulo ships implementations for several common Java types • Also for e.g. reversing the sort order and building compound keys.
  • 49. 49© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Inefficiencies in our data model Key Value Row Column Family Column Qualifier Visibility author, image url, and comment reader id discussion id comment order group id
  • 50. 50© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Two categories of data Key Value Row Column Family Column Qualifier Visibility author, image url reader id discussion id image group id Key Value Row Column Family Column Qualifier Visibility author, comment reader id discussion id text group id
  • 51. 51© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 And now our data looks like this
  • 52. 52© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 And the scan for a page covers less data
  • 53. 53© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 “The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.” Accumulo PMC via https://accumulo.apache.org/
  • 54. 54© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Our simplified diagram
  • 55. 55© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Slightly less simplified
  • 56. 56© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Back to the data model Key Valu e Ro w Column Tim eFamily Qualifier Visibility
  • 57. 57© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Back to the data model Key Valu e Ro w Column Tim eFamily Qualifier Visibility
  • 58. 58© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Rows are grouped into Tablets • Tablet is defined by a start and end row • All cells for a given row must be in the same Tablet.
  • 59. 59© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Tablets are assigned to Tablet Servers • At any given point in time, a Tablet is serviced by a single Tablet Server
  • 60. 60© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Slightly less simplified
  • 61. 61© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Tablets are assigned to Tablet Servers • At any given point in time, a Tablet is serviced by a single Tablet Server • That server is responsible for client reads and writes to all hosted Tablets • Finding the proper server is handled by the Accumulo libraries • Proper key design means io load gets spread across multiple machines
  • 62. 62© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 “The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.” Accumulo PMC via https://accumulo.apache.org/
  • 63. 63© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Tablet assignment is not static • Assignment tend to have steady state • But can move in the event of new resources or failure
  • 64. 64© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Remember our RDBMS scaling?
  • 65. 65© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 New RDBMS shard 1. Provision hardware for service 2. Rewrite data under new sharding 3. Update application services • Doing this without an outage is hard work (and well paid if you can get it)
  • 66. 66© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 New Accumulo Tablet Server 1. Provision hardware for service 2. Add server to cluster 3. Tablets automatically migrate from busier nodes to new node • No outage from client perspective.
  • 67. 67© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 “The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.” Accumulo PMC via https://accumulo.apache.org/
  • 68. 68© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 All distributed systems have communication failures In the face of such a failure you can either • remain available on remaining nodes to all clients • provide a consistent view of updates to a subset of clients
  • 69. 69© 2015 Cloudera licensed CC-BY-SA 2.0 Now you know the basics of CAP Remember that you can’t give up partition tolerance
  • 70. 70© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Remember our RDBMS robustness?
  • 71. 71© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Accumulo is a CP system • Tablet Servers ensure that updates have been written to a distributed write-ahead-log before acknowledging • Tablet Server failures are automatically detected • Newly assigned hosts for recovered Tablets then replay edits up until last ack before serving new requests
  • 72. 72© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
  • 73. 73© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Client write
  • 74. 74© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Write goals • Low latency ack • Don’t lose acked writes in face of node failure
  • 75. 75© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Client write 1
  • 76. 76© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Client write 1 2
  • 77. 77© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Client write 1 2 3
  • 78. 78© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
  • 79. 79© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
  • 80. 80© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
  • 81. 81© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
  • 82. 82© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0
  • 83. 83© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Recovery timing • Tunable time to detection – increases network load • Size of outstanding write ahead logs
  • 84. 84© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Client write 1 2 3 4
  • 85. 85© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Accumulo-based App Layout
  • 86. 86© 2015 Cloudera licensed CC-BY-SA 2.0 What’s the catch?
  • 87. 87© 2015 Cloudera, Inc. licensed CC-BY-SA 2.0 Gaps • Still requires application updates to use API – no interactive SQL bindings* • No Disaster Recovery – coming in next minor release
  • 88. Thank you. Mr. Mean photo from mockup is © 2004 Flickr user aznewbeginning; cc-by-sa 2.0 https://flic.kr/p/4uzdRc

Editor's Notes

  • #4: Accumulo is a distributed key value store based on the BigTable paper.
  • #5: Technology introduction talks are often driven from the perspective of what and how an application works rather than why. So this evening I’m going to start with a common use case and the talk about how Accumulo’s implementation addresses issues for that use. This means you’ll have to bear with me as we establish how we might end up with a set of problems that Accumulo helps solve.
  • #6: Involved in Accumulo for a while. Recently I’ve been working more on HBase, another distributed key value store based on bigtable.
  • #7: I do still have the scale problems of our supported customers and the project communities I work within. I also spent my years prior to Cloudera building scalable systems I can’t talk about. The upshot is that we’ll have to come up with a convincing contrivance.
  • #8: That cat’s name is actually Mr. Mean. The photo is from https://flic.kr/p/4uzdRc © 2004 Flickr user aznewbeginning; cc-by-sa 2.0
  • #9: Make sure you can see my sweet website mockup. The photo is from https://flic.kr/p/4uzdRc © 2004 Flickr user aznewbeginning; cc-by-sa 2.0
  • #10: Because Cute Cat Conversations Dot Com cares about giving users control over their pictures, we require that when a user removes someone from a group that person doesn’t still see old conversations. So groups are both a distribution and an authorization mechanism
  • #11: These are logical components. You’re probably running them on a single node.
  • #13: All nice, tidy, and easy to reason about. We can straightforwardly look at how we can add conversations, comments, and manage groups with a minimal number of updates.
  • #14: Our cat conversations get a little popular, we need to make sure one component failure doesn’t sink us.
  • #15: So you add another application server, and you set up a replication for your database. Depending on the specifics of the database you use and how much time you invest, you might have to deal with brief outages while you handle fail over yourself.
  • #16: For most applications, as you gain users and activity there’ll be substantial gains to just doing sessionized load balancing against more application servers. In most cases, this will also buy you more robustness at the application layer. More importantly, you still have a nice to reason about relational model for your data and a relatively easy to administer data store.
  • #18: You’ll need to do some filtering an ordering in addition, but you’ll need to hit these two joins because you must turn the current user into a set of groups. You might break this into two queries, one for a set of discussions and one for a set of comments, but you’ll still going to have latency because of having to look across tables or round trip to the application server.
  • #19: Now we can get everything we need for a given user’s page with a single table. (We could also just set the image url to null for the rows that are comments)
  • #20: However, originally each new comment or conversation start only required updating a single row. Like when this comment was added.
  • #21: Now that same single comment involves writing 3x as many rows. If you look above, the same applies for posting a new conversation image.
  • #22: When your read latency are important this is an established optimization technique. It’s called fan-out because the writes from a given consumer form a “fan” as they connect to a subset of the potential consumer lists. This pattern comes up whenever you’re going to have a large scale number of readers in a time sensitive context, like the web, that receive updates from a smaller set of producers. Think social sharing sites like Pinterest, Twitter, or even Google Plus.
  • #23: Generally, I think if you get to the point of implementing fan-out you should look seriously at moving to one of the distributed keyvalue stores. But traditional RDBMS isn’t done just yet.
  • #24: When we want to build the page for a given user, we just need the rows corresponding to them.
  • #27: As you can see, under this set up, each of the application servers will talk to some set of databases depending on what users they are servicing. When writes happen, they will need to be broadcast to every shard that contains a user in the appropriate group.
  • #29: Hopefully it’s been ~15-20 minutes. The cost for that lower operational overhead is that Accumulo is going ot make us think about our data organization more.
  • #30: The headline description from the project itself. We’ll break down the pieces of this description and how they end up easing the pain in our current scaled up application.
  • #31: As you can see, under this set up, each of the application servers will talk to some set of databases depending on what users they are servicing. When writes happen, they will need to be broadcast to every shard that contains a user in the appropriate group.
  • #32: It’s important that we start with the fundamental limitation of Accumulo: it’s a key value store and does not provide a relational model.
  • #33: You read and write values given a particular key.
  • #34: Keys are made up of a row, a column, and timestamp.
  • #35: A Column, in turn, is actually made up of 3 parts. A family, or general grouping of similar columns, a qualifier that specifies which coordinate within the family, and a visibility. We’ll cover how some of these key-parts are treated specially in a little bit. Generally, you can just think of it like a big multi-dimensional map.
  • #37: We’ll cover this last bit more in a few minutes.
  • #38: This is our read-oriented de-normalized conversation table
  • #39: When we want to build the page for a given user, we just need the rows corresponding to them. So we’ll take the cell-per-record approach, and use the reader id as a shard indicator in the row id.
  • #41: Mind you, this is just a first pass.
  • #42: Note that we’d set each cell’s visibility to be the group the message went to.
  • #43: If storage is at a premium, we can handle deleting cells we know a user won’t see in an offline way.
  • #44: Because Accumulo only deals with cells at its core, it doesn’t presume that a column being present in one row means it will be present in another. It stores nothing when a column doesn’t exist. This means we can have extremely wide tables that are only sparsely populated; perfect for the fan-out of our cat conversations.
  • #45: Accumulo asserts a total ordering across all keys.
  • #46: Sort is done key-component wise with decreasing priority across: row, family, qualifier, visibility, and finally time.
  • #48: A common difficulty for building on Accumulo is that you need an increased awareness of how parts of Accumulo will interact with your chosen data layout. Rather than something you can reason about once there are issues (like adding an index to a RDBMS), you need to work it out at the time of application design. To be performant, we need to make sure that our access pattern for a given user will be a small number of these sequential ranges. That means we have to understand how our chosen keys lay out for Accumulo scans. This layout and scan makes me think of two issues for our application.
  • #49: Out of the box, Accumulo will give you lexicoders for all the primitive types as well as java Strings, Date, and BigInteger objects. It will also let you build a sortable representation of a list of encoded values.
  • #50: This is how we’re laying things out again. First entry is always just the image; later entries never need the image because they got it at the start of the scan. Recall earlier when I mentioned that Accumulo just treats the bytes as-is and it’s common for applications to use multiple meanings for a cell value depending on the column. So let’s remove the placeholders in our values and instead make it explicit when a cell is the image for the start of a conversation and when it’s a comment.
  • #51: By default, Accumulo will only keep a single version of a given key around; it decides which one to keep based on whichever is newest according to the timestamp in the key. To simplify our current data model, we’re going to configure it to keep an arbitrary number of versions. This will allow us to leave the “comment order” out of our key entirely. We can either set the time based on the posting client or we can rely on the order Accumulo receives updates. We’d always receive them when reading most-recent-first.
  • #52: We’ve complicated the mapping from our original database. We’re relying on the way scans work in Accumulo to simplify how we interact with our dataset
  • #53: By relying on the timestamp and multiple cell versions, we do end up with most-recent-first ordering on comments. On the downside, we’ll have to reverse for display. On the plus side, we can easily do things like previews of most-recent-comment.
  • #54: At its simplest, this must means that Accumulo will scale across many machines. Unlike our manual database sharding, this should be transparent to you.
  • #55: Our diagram is a bit of an oversimplification
  • #56: We can add in a bit of detail. The requests from our clients are going to be served within the cluster by a set of Tablet Servers. Unfortunately, it won’t make much sense to talk about them without fist going back to our data model for a second.
  • #58: When I said earlier that we’ll treat the row like a database shard, I wasn’t just talking for our application. Internally, Accumulo manages cells in groups of rows.
  • #59: Practically, this means that the row is the atomic unit of parallelizability within an Accumulo system. In our case, we don’t expect one use to be in so many other people’s cat conversation groups that a single machine couldn’t handle their stream. In other use cases we may have to account for this in our key design.
  • #61: If you look closely, you can see the tablets!
  • #62: In particular, this means that we should probably use a hash on our user ids to ensure we don’t get a contiguous block of group members all going to the same server. Besides having to know about the contiguous group members issue, we don’t need to embed any other knowledge about the way sharding is handled into our application.
  • #63: Not having the logic in our application server also means that maintenance tasks like expanding our cluster is easier.
  • #64: Accumulo is horizontally scalable and tested at very large cluster sizes.
  • #65: Adding new hardware resources is equivalent ot adding a new shard.
  • #66: Hard engineering work mean expensive.
  • #67: That’s it. Once the server comes online, Accumulo’s internal coordination service will recognize that there are more physical resources available on the cluster and safely migrate Tablets from busier servers over to the new one.
  • #68: Accumulo has no single point of failure and safely recovers from partial failures.
  • #69: This is the CAP theorem, in brief. You can’t choose to give up “partition tolerance.”
  • #70: This is the CAP theorem, in brief. You can’t choose to give up “partition tolerance.”
  • #71: We had some fault tolerance. If fortunate we had automatic failover. If _very_ fortunate we had those without data loss. If we lost more nodes in a particular shard then we had replication set up for, that set of users was just out of luck until someone got paged. Whether this storage system favored availability or consistency is very implementation dependent. Most that I have dealt with chose availability because the replication was not synchronous.
  • #73: Remember zooming in here?
  • #75: Can’t write directly to persistent storage, because that’s all sorted.
  • #76: First, we write to a distributed write ahead log. These logs are append-only and written to other nodes via an underlying distributed file system. The are only used in the event that the node fails before we can update persistent storage.
  • #77: Once we are assured that there is a safe copy for recovery, we write the update into our buffer of accepted writes.
  • #78: Then we ack the client and the write is visible to the world.
  • #79: Now it’s possible that after that ack we’ll have a failure.
  • #80: Like this.
  • #81: After a tunable timeout, there’s a coordinator system that will notice the node is down.
  • #82: It will have the remaining Tablet Servers load the write ahead logs from the down server and
  • #83: When they’re done, then the Tablets from the down server will be reassigned. This assignment is a light weight RPC. It just tells the Tablet Server to take ownership of the Tablets, perform any recovery out of distributed storage, and then serve client requests.
  • #84: Can’t write directly to persistent storage, because that’s all sorted.
  • #85: In order to keep the size of write ahead logs down, the Tablet Server occasionally flushes buffered writes out into newly sorted files on persistent storage.
  • #86: Now that we’ve covered the internals of recovery, we can see that in addition to easier migration paths, we also have better robustness guarantees because our shards will move themselves around as failures occur, allowing for a more graceful degradation in the face of failures.
  • #87: What are the big gaps still?
  • #88: In open source. There’s a private company that has modified Presto. Replication in currently named 1.7.0