SlideShare a Scribd company logo
Stock  Market  Order  Flow  
Reconstruction    
Using  HBase  on  AWS	
Aaron Carreras, HBaseCon
– San Francisco, May 2015
About  Presenter	
•  Director of Enterprise Data Platforms at FINRA
•  Data Ingestion, Processing and Management
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
WHAT DO WE DO?
• Collect and Create
•  33B events/day
•  18 national exchanges
•  Equities, Options and Fixed Income
•  Reconstruct the market from trillions of events spanning years
• Detect & Investigate
•  Identify market manipulations, insider trading, fraud and compliance violations
• Enforce & Discipline
•  Ensure rule compliance 
•  Fine and bar broker dealers
•  Refer matters to the SEC and other authorities
TRF	
FIRM	
Exchange	
Dark  Pool
What  stock  trade  looks  like  to  the  investor
Example  of  what  is  actually  happening
Ingest/Access  PaJerns
Configurations/Approaches  in  
Common  
  
Logical  Architecture	
CDH  4.5;  HBase  0.94.6;  EC2  hs1.8xlarge  –  16  (vCPU),  117  (GiB),  24  drives  x  2,000  (GB)  
Row  Key  Design  &  Pre-­‐‑spliJing	
•  Salt Our Row Keys
o  Our “natural” keys are
monotonically
increasing
o  Row Key = salt (PK) +
PK
•  Pre-split
•  Better control of
distribution of data
across regions
Compactions  &  SpliJing  Configurations	
Parameter	
 Default	
 Override	
hbase.hregion.majorcompaction	
 7  days	
 0  (disable)	
hbase.hstore.compactionThreshold	
 3	
 10	
hbase.hstore.compaction.max	
 10	
 15	
hbase.hregion.max.filesize	
 10  GB	
 200  GB	
RegionSplitPolicy	
 IncreasingToUpperBoundRegionSplit
Policy	
ConstantSizeRegionSplitPol
icy	
hbase.hstore.useExploringCompati
on	
false	
 true
OS  Configuration  Considerations	
§  Some of these may not be relevant to you depending on your
OS/Version but are worth confirming
Parameter	
 Se1ing	
redhat_transparent_hugepage/
defrag	
never	
nofile/nproc  ulimit	
 32768	
tcp_low_latency	
 1  (enabled)	
vm.swappiness	
 0  (disabled)	
selinux	
 Disabled	
IPv6	
 no  (disabled)	
iptables	
 off/stop
Other  Hadoop  Configuration  
Considerations	
Where	
 Parameter	
 Se1ing	
core-­‐‑site.xml	
 ipc.client.tcpnodelay	
 true	
core-­‐‑site.xml	
 ipc.server.tcpnodelay	
 true	
hdfs-­‐‑site.xml	
 dfs.client.read.shortcircuit	
 true	
hdfs-­‐‑site.xml	
 fs.s3a.buffer.dir	
 [machine  specific]	
hbase-­‐‑site.xml	
 hbase.snapshot.master.timeoutMillis	
 1800000	
hbase-­‐‑site.xml	
 hbase.snapshot.master.timeout.millis	
 1800000	
hbase-­‐‑site.xml	
 hbase.master.cleaner.interval	
 600000  (ms)
Use  Case  ‘A’:  PaJerns
Use  Case  ‘A’:  Background	
•  Create graphs for historical market event data (trillion
records)
•  Basically a batch process
o  Each batch had ~ 4 billion events
o  Related events may span batches (e.g., root could arrive later, children
may be corrected, etc.)
•  Back process prior 18 months (540 batches)
•  Complete the project given the and
Use  Case  ‘A’:  Utilize  Bulk  Loads	
•  Back processing and ongoing update process is 100% Bulk HFile load
•  Our column families and processing aligned with this approach by splitting the linkage
and content into separate column families
•  Eliminate Puts completely and the WAL writes, memstore flushes, and additional
compactions that often accompany them
HFile  Bulk  Load
Use  Case  ‘A’:  Optimize  Gets	
•  Used sorted / partitioned batched Gets
o  Minimize required RPC calls
o  Leverage sorting to better leverage block cache
•  Allocate more on-heap memory for reads
Parameter	
 Default	
 Override	
hfile.block.cache.size	
 .4	
 .65	
hbase.regionserver.global.memstore.upperLi
mit	
.4	
 .15
Use  Case  ‘B’:  PaJerns
Use  Case  ‘B’:  Background	
•  Not a once a day batch process, it must process the
data as it arrives
o  200+ business rules covering data validation, create/break linkages, and
identify compliance issues within SLA
o  Progressively build the tree
•  The different processes required different access
paths sometimes requiring multiple copies of some
portions of the data
Use  Case  ‘B’:  Put  Strategy	
•  HFiles for the
incremental
processing
didn’t fit as
well here
•  Partitioned
Batch Puts
•  memstore vs
block cache
(50/50)
Use  Case  ‘B’:  Scan	
•  Scan
o  Distinct Daily along with a single Historical table to more naturally
support the processing
o  Scan Daily tables only
o  Switched from Get to Scan for rows with millions of columns
Backup  and  DR  
  
HBase  Backup  to  S3	
•  HBase ExportSnapshots to S3 didn’t really support
our use case
•  Significant updates to the ExportSnapshot for S3
o  Support for S3A (HADOOP-10400)
o  Remove the expensive rename operation on S3 (HBASE-11119)
S3
Disaster  Recovery	
o  AWS provides multiple
Availability Zones (AZ) in different
geographic regions
o  HBASE snapshots backed up to
S3 and to a separate cluster in a
different AZ
o  S3 buckets are backed up from
one region to another for cross-
region redundancy
Running  Hadoop  on  AWS  
  
Lessons  Learned  
  
Running  Hadoop  on  AWS	
•  S3
o  For now at least, s3a is probably the file system implementation you want to
use (if you are not using EMR)
o  Rename is not a logical operation and therefore expensive
o  Eventual consistency should be accounted for
o  Consider turning S3 versioning on
•  Instance Types / Topology
o  # of virtual instances on a single physical host impacts fault tolerance
o  Tradeoff between network performance and availability/capacity
•  Region - Availability Zone - Placement Group
o  Be aware that Availability Zone identifiers are intentionally inconsistent
across accounts
Questions?

More Related Content

PPT
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
PPTX
Keynote: The Future of Apache HBase
PPTX
HBaseCon 2015: HBase and Spark
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
PDF
HBaseCon 2015- HBase @ Flipboard
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
PPTX
HBaseCon 2015: State of HBase Docs and How to Contribute
PPTX
HBaseCon 2015 General Session: State of HBase
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Keynote: The Future of Apache HBase
HBaseCon 2015: HBase and Spark
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2015- HBase @ Flipboard
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015 General Session: State of HBase

What's hot (20)

PDF
Facebook - Jonthan Gray - Hadoop World 2010
PPTX
Apache Spark on Apache HBase: Current and Future
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PPTX
HBase Backups
PPTX
A Survey of HBase Application Archetypes
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PPTX
HBase at Bloomberg: High Availability Needs for the Financial Industry
PPTX
HBaseCon 2015: HBase Operations in a Flurry
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PDF
Apache HBase in the Enterprise Data Hub at Cerner
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PPTX
HBaseCon 2013: ETL for Apache HBase
PDF
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
PPTX
HBaseCon 2013: Rebuilding for Scale on Apache HBase
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PDF
Large-scale Web Apps @ Pinterest
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
PPTX
Content Identification using HBase
Facebook - Jonthan Gray - Hadoop World 2010
Apache Spark on Apache HBase: Current and Future
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBase Backups
A Survey of HBase Application Archetypes
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Apache HBase in the Enterprise Data Hub at Cerner
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Large-scale Web Apps @ Pinterest
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Content Identification using HBase
Ad

Viewers also liked (20)

PDF
Apache HBase Improvements and Practices at Xiaomi
PPTX
Apache Kylin’s Performance Boost from Apache HBase
PPTX
Apache HBase at Airbnb
PDF
Improvements to Apache HBase and Its Applications in Alibaba Search
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
PPTX
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
PPTX
Real-time HBase: Lessons from the Cloud
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
PPTX
Rolling Out Apache HBase for Mobile Offerings at Visa
PDF
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
PPTX
Update on OpenTSDB and AsyncHBase
PDF
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
PPTX
Digital Library Collection Management using HBase
PPTX
HBaseCon 2015: HBase @ CyberAgent
PPTX
HBaseCon 2013: Full-Text Indexing for Apache HBase
PPTX
HBaseCon 2015: Analyzing HBase Data with Apache Hive
PDF
HBaseCon 2015: HBase @ Flipboard
Apache HBase Improvements and Practices at Xiaomi
Apache Kylin’s Performance Boost from Apache HBase
Apache HBase at Airbnb
Improvements to Apache HBase and Its Applications in Alibaba Search
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
Real-time HBase: Lessons from the Cloud
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Rolling Out Apache HBase for Mobile Offerings at Visa
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
Update on OpenTSDB and AsyncHBase
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Digital Library Collection Management using HBase
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: HBase @ Flipboard
Ad

Similar to HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS (20)

PDF
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
PPTX
A Scalable Data Transformation Framework using Hadoop Ecosystem
PPTX
Hbasepreso 111116185419-phpapp02
PPTX
Storage Infrastructure Behind Facebook Messages
PPTX
Apache HBase Internals you hoped you Never Needed to Understand
PPTX
Introduction to Apache HBase
PPT
A Scalable Data Transformation Framework using the Hadoop Ecosystem
PDF
Hbase: an introduction
PDF
Hbase 20141003
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
ODP
HBase introduction talk
PDF
Hadoop at datasift
PPTX
HBase Low Latency, StrataNYC 2014
ODP
Apache hadoop hbase
PDF
hbaseconasia2017: hbase-2.0.0
PPTX
HBase operations
PPTX
HBase New Features
 
PPT
Chicago Data Summit: Apache HBase: An Introduction
PDF
Facebook keynote-nicolas-qcon
PDF
Facebook Messages & HBase
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
A Scalable Data Transformation Framework using Hadoop Ecosystem
Hbasepreso 111116185419-phpapp02
Storage Infrastructure Behind Facebook Messages
Apache HBase Internals you hoped you Never Needed to Understand
Introduction to Apache HBase
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Hbase: an introduction
Hbase 20141003
Unit II Hadoop Ecosystem_Updated.pptx
HBase introduction talk
Hadoop at datasift
HBase Low Latency, StrataNYC 2014
Apache hadoop hbase
hbaseconasia2017: hbase-2.0.0
HBase operations
HBase New Features
 
Chicago Data Summit: Apache HBase: An Introduction
Facebook keynote-nicolas-qcon
Facebook Messages & HBase

More from HBaseCon (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
hbaseconasia2017: HBase on Beam
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PDF
hbaseconasia2017: Apache HBase at Netease
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
PDF
hbaseconasia2017: HBase at JD.com
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
HBaseCon2017 Democratizing HBase
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
PDF
HBaseCon2017 Transactions in HBase
PDF
HBaseCon2017 Highly-Available HBase
PDF
HBaseCon2017 Apache HBase at Didi
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
PDF
HBaseCon2017 Improving HBase availability in a multi tenant environment
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon2017 Democratizing HBase
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Transactions in HBase
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 Improving HBase availability in a multi tenant environment

Recently uploaded (20)

PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Essential Infomation Tech presentation.pptx
PDF
AI in Product Development-omnex systems
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Softaken Excel to vCard Converter Software.pdf
Odoo Companies in India – Driving Business Transformation.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms I-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
How to Choose the Right IT Partner for Your Business in Malaysia
Wondershare Filmora 15 Crack With Activation Key [2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
2025 Textile ERP Trends: SAP, Odoo & Oracle
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Odoo POS Development Services by CandidRoot Solutions
Essential Infomation Tech presentation.pptx
AI in Product Development-omnex systems
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
L1 - Introduction to python Backend.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

  • 1. Stock  Market  Order  Flow   Reconstruction     Using  HBase  on  AWS Aaron Carreras, HBaseCon – San Francisco, May 2015
  • 2. About  Presenter •  Director of Enterprise Data Platforms at FINRA •  Data Ingestion, Processing and Management
  • 4. WHAT DO WE DO? • Collect and Create •  33B events/day •  18 national exchanges •  Equities, Options and Fixed Income •  Reconstruct the market from trillions of events spanning years • Detect & Investigate •  Identify market manipulations, insider trading, fraud and compliance violations • Enforce & Discipline •  Ensure rule compliance •  Fine and bar broker dealers •  Refer matters to the SEC and other authorities TRF FIRM Exchange Dark  Pool
  • 5. What  stock  trade  looks  like  to  the  investor
  • 6. Example  of  what  is  actually  happening
  • 9. Logical  Architecture CDH  4.5;  HBase  0.94.6;  EC2  hs1.8xlarge  –  16  (vCPU),  117  (GiB),  24  drives  x  2,000  (GB)  
  • 10. Row  Key  Design  &  Pre-­‐‑spliJing •  Salt Our Row Keys o  Our “natural” keys are monotonically increasing o  Row Key = salt (PK) + PK •  Pre-split •  Better control of distribution of data across regions
  • 11. Compactions  &  SpliJing  Configurations Parameter Default Override hbase.hregion.majorcompaction 7  days 0  (disable) hbase.hstore.compactionThreshold 3 10 hbase.hstore.compaction.max 10 15 hbase.hregion.max.filesize 10  GB 200  GB RegionSplitPolicy IncreasingToUpperBoundRegionSplit Policy ConstantSizeRegionSplitPol icy hbase.hstore.useExploringCompati on false true
  • 12. OS  Configuration  Considerations §  Some of these may not be relevant to you depending on your OS/Version but are worth confirming Parameter Se1ing redhat_transparent_hugepage/ defrag never nofile/nproc  ulimit 32768 tcp_low_latency 1  (enabled) vm.swappiness 0  (disabled) selinux Disabled IPv6 no  (disabled) iptables off/stop
  • 13. Other  Hadoop  Configuration   Considerations Where Parameter Se1ing core-­‐‑site.xml ipc.client.tcpnodelay true core-­‐‑site.xml ipc.server.tcpnodelay true hdfs-­‐‑site.xml dfs.client.read.shortcircuit true hdfs-­‐‑site.xml fs.s3a.buffer.dir [machine  specific] hbase-­‐‑site.xml hbase.snapshot.master.timeoutMillis 1800000 hbase-­‐‑site.xml hbase.snapshot.master.timeout.millis 1800000 hbase-­‐‑site.xml hbase.master.cleaner.interval 600000  (ms)
  • 15. Use  Case  ‘A’:  Background •  Create graphs for historical market event data (trillion records) •  Basically a batch process o  Each batch had ~ 4 billion events o  Related events may span batches (e.g., root could arrive later, children may be corrected, etc.) •  Back process prior 18 months (540 batches) •  Complete the project given the and
  • 16. Use  Case  ‘A’:  Utilize  Bulk  Loads •  Back processing and ongoing update process is 100% Bulk HFile load •  Our column families and processing aligned with this approach by splitting the linkage and content into separate column families •  Eliminate Puts completely and the WAL writes, memstore flushes, and additional compactions that often accompany them HFile  Bulk  Load
  • 17. Use  Case  ‘A’:  Optimize  Gets •  Used sorted / partitioned batched Gets o  Minimize required RPC calls o  Leverage sorting to better leverage block cache •  Allocate more on-heap memory for reads Parameter Default Override hfile.block.cache.size .4 .65 hbase.regionserver.global.memstore.upperLi mit .4 .15
  • 19. Use  Case  ‘B’:  Background •  Not a once a day batch process, it must process the data as it arrives o  200+ business rules covering data validation, create/break linkages, and identify compliance issues within SLA o  Progressively build the tree •  The different processes required different access paths sometimes requiring multiple copies of some portions of the data
  • 20. Use  Case  ‘B’:  Put  Strategy •  HFiles for the incremental processing didn’t fit as well here •  Partitioned Batch Puts •  memstore vs block cache (50/50)
  • 21. Use  Case  ‘B’:  Scan •  Scan o  Distinct Daily along with a single Historical table to more naturally support the processing o  Scan Daily tables only o  Switched from Get to Scan for rows with millions of columns
  • 23. HBase  Backup  to  S3 •  HBase ExportSnapshots to S3 didn’t really support our use case •  Significant updates to the ExportSnapshot for S3 o  Support for S3A (HADOOP-10400) o  Remove the expensive rename operation on S3 (HBASE-11119) S3
  • 24. Disaster  Recovery o  AWS provides multiple Availability Zones (AZ) in different geographic regions o  HBASE snapshots backed up to S3 and to a separate cluster in a different AZ o  S3 buckets are backed up from one region to another for cross- region redundancy
  • 25. Running  Hadoop  on  AWS     Lessons  Learned    
  • 26. Running  Hadoop  on  AWS •  S3 o  For now at least, s3a is probably the file system implementation you want to use (if you are not using EMR) o  Rename is not a logical operation and therefore expensive o  Eventual consistency should be accounted for o  Consider turning S3 versioning on •  Instance Types / Topology o  # of virtual instances on a single physical host impacts fault tolerance o  Tradeoff between network performance and availability/capacity •  Region - Availability Zone - Placement Group o  Be aware that Availability Zone identifiers are intentionally inconsistent across accounts