SlideShare a Scribd company logo
| HBaseCon 2016 | May 24, 20161
Rolling Out Apache HBase
for Mobile Offerings at Visa
Partha Saha
pasaha@visa.com
CW Chung
cchung@visa.com
| HBaseCon 2016 | May 24, 20162
Data loaded in real-time
Over 100 Billion rows as
history from most recent
Milli-second response times
for write/read
What this talk is about – A choice of NoSQL at Visa
Scale
Speed
Real-time
| HBaseCon 2016 | May 24, 20163
An example of a mobile offering
Add card to wallet
Pay
For
Purchase
See your transaction
Right away
along with
recent history
Need
NoSQL
Here
| HBaseCon 2016 | May 24, 20164
We chose HBase as a NoSQL solution.
We built a scalable and real-time Transaction History
Service.
We migrated prominent Mobile wallet offerings to the
Service.
This talk is about our learnings over the last year.
| HBaseCon 2016 | May 24, 20165
This talk …
1. We assume some knowledge and familiarity of HBase.
2. We used HBase 1.0.0 with Cloudera Distribution CDH 5.4.3, so our observations
are based on that version of HBase.
3. We cover the important learning events along the way of adoption of HBase
in Visa
1. These can help new teams adopting HBase so that they avoid the same
pitfalls.
2. Our learning continues as we take on more interesting and challenging
opportunities.
| HBaseCon 2016 | May 24, 20166
Is YCSB a good way to compare NoSQL options?
| HBaseCon 2016 | May 24, 20167
It is actually not…
• Unless you know how to configure your NoSQL options for optimal performance…
• You may be driven to another solution, because its performance seems “smoother”
and easier to explain by rudimentary knowledge.
0
20000
40000
60000
1
12
23
34
45
56
67
78
89
100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
Series2
0
20000
40000
1
12
23
34
45
56
67
78
89
100
111
122
133
144
Series2
• It is a great tool however to observe how system configuration changes
performance, and explore the configuration space for various workloads.
| HBaseCon 2016 | May 24, 20168
Our YCSB experience…
• Very easy to set up!
• Got a baseline of HBase performance of the cluster. Rerun after significant
configuration & application code changes.
• Key parameters used:
– # of client threads
– # of operations
– # records in Data Set
– Workload mix of read/update/insert. (We added 100% insert/update workload).
– Use a bash driver script to test various combinations of parameters.
• Latency measurement type can be in histogram or timeseries. Both were useful.
| HBaseCon 2016 | May 24, 20169
Should you design yourself out of major compactions?
| HBaseCon 2016 | May 24, 201610
Not worth the trouble when you are starting…
• An argument may be made that if we need an “N” day rolling look back, we can
have daily tables that we create before and delete past the look back window. We
can then reason about how to compact each daily file. Will that make the system
operate better?
• Write amplification is a well known problem and gets a lot of attention, but
however, worrying about the problem during early design stages seemed like
premature optimization.
• We thought that we could always optimize later through rolling compactions and
diurnal patterns of traffic later once patterns of reads and writes were fully
understood.
| HBaseCon 2016 | May 24, 201611
Does your design need transactional support?
| HBaseCon 2016 | May 24, 201612
We analyzed our secondary and primary key
read/writes.
Primary key Fact
pk1
pk2
Seconda
ry key
Associations
sk1 {pk1}
sk2 {pk1, pk2}
Query keys for facts
Register
associations
• We concluded, by tracing reads and failures
through updates that inconsistencies were
short lived.
• We would have used a transaction support
library otherwise.
| HBaseCon 2016 | May 24, 201613
How do you hands-on learn about HBase without
going into Production?
| HBaseCon 2016 | May 24, 201614
We built a Continuous Integration and Learning
Environment
Build
Server
git/
Stash
Bamboo
Artifactory Client
Bamboo plan
Chef
Client
- Checkout
- Build
- Upload
- Deploy
- Run test
Test
Server
| HBaseCon 2016 | May 24, 201615
How do get Operations ready for HBase in Production?
| HBaseCon 2016 | May 24, 201616
We allocated one developer for 1 day/week to monitor
production problems …
Bangalore
India
Foster City
CA, USA
1. We shadowed the real
production
2. Any production
problem was given
priority by the whole
team
3. We used 2 sites for
24x7 eyes
4. Added Alert and
Monitoring dashboards
5. We launched only when
when we met certain
metrics
| HBaseCon 2016 | May 24, 201617
Loading data in real-time as it is read
| HBaseCon 2016 | May 24, 201618
We used a micro-batch approach
Pre-
Processor
Listing &
Sender
Tracker
Loader Master
Receiver
Loader Worker
Batch
Processor
LLF Reader
HBase
Load
Batch
Processor
LLF Reader
HBase
Load
Batch
Processor
Stream
Reader
HBase
Load
Listing &
Sender
Tracker
Notification
Master
Receiver
Notification
Worker
Batch
Processor
LLF
Reader
HBase
Registration
Query
Send
Notification
Batch
Processor
LLF
Reader
HBase
Registration
Query
Send
Notification
Batch
Processor
Stream
Reader
HBase Query
Send
Notification
IPC IPC
Micro-Batch (250 ms) Control and State Files
readswrites
1 per Stream 1..N per Master 1 per Stream1..N per Master
Stream N
Stream 2
stream1
….....
tail
We had to build an approach to remember and retry from
any point in each stream
| HBaseCon 2016 | May 24, 201619
Reading via Web Servers
| HBaseCon 2016 | May 24, 201620
The web-services Front End
Audit
DB
MQ
Config
Service
Access
Authorization
Encryption
UtilityAudit
Load
Distribution
Plugin
Cache
Subscription
Service
Failover
Service
BusinessComponent
DataService
Web
Service
Wrapper
Rest
Controller
API
Request
Transform
Response
Transform
Domain
Objects
Audit
Listener
HBaseAPI
HBase
Plugin
HBase Cluster
Gateway
and
Load
Balancer
| HBaseCon 2016 | May 24, 201621
Availability
| HBaseCon 2016 | May 24, 201622
We used 2 data centers to get availability
Data Center 1
Streams
Data Center 2
Streams
Replication of
non-native
streams
We use shadow tables to write for the other
when the other is down, and drain the shadow
tables for the other to catch-up
| HBaseCon 2016 | May 24, 201623
Learning your Data Center clock
| HBaseCon 2016 | May 24, 201624
HBase is sensitive to clock skew…
• Kerberos services do not tolerate more than a few minutes of clock skew.
• Warnings are generated for a small skews, large skews kill region-servers.
| HBaseCon 2016 | May 24, 201625
Client retries
| HBaseCon 2016 | May 24, 201626
Client retries & IOExceptions
• Default HBase timeout/retries settings can take tens of minutes to timeout:
– hbase.rpc.timeout: 60 sec
– hbase.client.retries.number: 35
– hbase.client.pause: 100 msec (grows to 10 sec quickly after back-off)
– Longer when factor in potential retries by zookeeper!
– Blogs by Lars Hofhansl: “HBase Client timeouts”, “HBase client response times”
• We choose Fail Fast strategy, as end user device will do end-to-end retry.
• Timeout/retries settings: 1 sec timeout, 3 total tries.
– Works well for the same data center, as well as across data centers
• However, once a while, clients see IOExceptions!
– Caused by Region Server (busy in GC, major/minor compaction, … ?)
– Or the Network?
– Or the Client itself?
| HBaseCon 2016 | May 24, 201627
Correlating client exceptions
| HBaseCon 2016 | May 24, 201628
Correlating client exceptions
• Client side:
– Turn on hbase client debugging:
• log4j.logger.org.apache.hbase.client=DEBUG
• log4j.logger.org.apache.hbase.ipc=DEBUG
– Catch the exceptions to print out specific Region Server name:
• IOException, RetriesExhaustedWithDetailsException
• Server side:
– Then look into the specific Region Server log of that server.
• Works well when you know the specific server causing the IOExceptions.
– What if not?
| HBaseCon 2016 | May 24, 201629
Correlating client exceptions
• Build Root Cause Analysis software to:
– Collect the relevant logs from the sources:
• Client: application logs, hbase client logs, GC logs
• Hadoop server: HBase, HDFS, Zookeeper server and GC logs
• Cluster events: Cloudera Manage API
• Other logs: KDC logs, Kerberos canary, network latency monitoring
– Parse the logs (single line, multi-line text, json, xml) into csv files.
– Normalize data and time format, apply date and time range filtering.
– Apply text filtering and text reduction on verbose lines.
– Output: events csv, sorted by time and server, suitable for grep/awk/sort, hive/sql.
• Quickly get an total view of the sequence of events of various services.
• Sometime can identify the smoking gun (e.g. exception caused by GC ).
• Still useful in the few cases when no smoking gun can be found!
– Trouble-shooting is also a process of elimination.
| HBaseCon 2016 | May 24, 201630
Kerberos Gotchas
| HBaseCon 2016 | May 24, 201631
Kerberos Gotchas – what we have learned
• Hostname uses FQDN (Fully Qualified Domain Name, like server123.abc.com)
• Use TCP rather than UDP (set udp_preference_limit = 1 in krb5.conf)
• KDC (MIT Kerberos) server:
– Configure to start up several kdc processes to handle bursty traffic (use –w option).
– Set up a backup kdc for higher availability.
• Debugging tips:
– $ export KRB5_TRACE=/dev/stderr (or to a file)
– $ log4j: -Dsun.security.krb5.debug=true
• Kerberos support is built into the Java JRE, using internal classes:
– Oracle JDK: com.sun classes; on IBM AIX: com.ibm
– Hadoop is built and tested against Oracle JDK ( mileage on AIX JDK varies).
• Good references (besides the usual documents on Kerberos, and HBase User mailing list):
– Steve Loughran: Hadoop and Kerberos: The Madness beyond the Gate.
– HBase and Hadoop common source code: UserGroupInformation.java.
| HBaseCon 2016 | May 24, 201632
Kerberos Gotchas – what we learned
– Renewing a TGT Ticket (Ticket Granting Ticket)
• After kinit successfully, application principal gets a Kerberos TGT ticket.
• By default, the TGT ticket is good for 10 hours.
• For long-running applications, 10 hours obviously is not enough: need to renew TGT.
• Initially uses a process/thread to do a kinit once every few hours.
– Still ran into some IOExceptions at the time of TGT of renewal.
– Not the recommended way for long-running applications.
• Now uses UGI API (UserGroupInformation): loginUserFromKeytab( ).
– Does not require a separate process/thread to do TGT renewal.
– Hadoop/HBase client class library will catch the exception due to TGT expiration, and will do a
reloginFromKeytab( ) to renew TGT automatically.
– Also considering spawn a thread and proactively invoke CheckTGTAndRelogin( ).
– Ongoing investigation: client occasionally still experiencing momentary IOException around the
time ticket renewal.
– Referral Ticket: when on realm is set up to trust another realm, be aware of the additional
kdc calls resulted when the kinit principal is from the trusted realm.
| HBaseCon 2016 | May 24, 201633
Garbage Collection
| HBaseCon 2016 | May 24, 201634
Garbage Collection
• Use G1 on Oracle JDK 1.8
• Basically using settings as recommended by folks from HBaseCon2015.
– By Eric Kaczmarek, Yanping Wang, Liqi Yi
• Set target GC pause to 100 msec; Young Gen to ~1GB.
• Observation consistent with their published results:
– Observed gc time in production:
• 100 msec or less: 67%
• 400 msec or less: 99.98%
• Important to track the actual production gc time, as Production and Test cluster
shows somewhat different distribution.
| HBaseCon 2016 | May 24, 201635
GC Duration comparison: production vs perf cluster
| HBaseCon 2016 | May 24, 201636
GC: How Good is MaxGCPauseMillis as a Target?
MaxGCPauseMillis = 100 Production Cluster
(gc in msec)
Test Cluster
(gc in msec)
# of gc events 165192 199883
Avg / Std Dev / Max 87.1 / 64.9 / 1530 msec 81.9 / 37.2 / 1370 msec
50 percentile (median) 80 msec 90 msec
95 percentile /
99% / 99.9% / 99.99%
210 msec /
270 / 450 / 660 msec
120 msec /
140 / 510 / 780 msec
Percentile of: 100 msec /
200 / 300 / 400 msec
67% /
95% / 99.4% / 99.8%
85% /
99.4% / 99.6% / 99.8%
| HBaseCon 2016 | May 24, 201637
In Conclusion…
| HBaseCon 2016 | May 24, 201638
Adopting an open source product is a journey…
• Learning from previous adoption successes is crucial – if use case has not been
tried/analyzed/written about before, chances are we have to pay for learning and
having alternate choices is a good idea.
• Making only one major technology change at a time is always a good idea.
• Setting up appropriate expectations through team members and agile processes is
important.
• Going to production scenario early as shadow and learning through frequent
releases is helpful.
• We believe extra capacity for peak workloads was very helpful.
• Having source code is very useful in learning and trouble-shooting.
| HBaseCon 2016 | May 24, 201639
It Takes a Village! Thank you!
Alexandr Peyko
Amit Sharma
Anthony Chu
Arindam Chakraborty
Artem Savinov
Aviral Agarwal
Bala Saravanan Kannan
Ben Crane
Carl Duque
Chetan Talanki
Debasis Mullick
Deepankar Palit
Hong Zhu
Igor Karpenko
Igor Peller
Igor Ulianitski
Jay Gardner
Jim Gordon
Karthikeyan Manickavasagan
Liang Gao
Murali Reddy
Nandakumar Jayakumar
Nimish Shah
Peter Meigs
Pradyot Sikdar
Praveen Rudraraju
Rajat Raj
Raj Merchia
Ralph Blore
Ranjan Dutta
Ricardo De Ocampo Domingo
Robert Walsh
Sabu Peter
Sam Hamilton
Sandeep Reddy
Satyaban Nandi
Soumya Das
Srijoy Aditya
Srinivas Reddy Surasani
Suchismita Nayak
Suresh Pulikara
Ujjwal Kumar
Vikash Talanki
Vinay Sarda
Waqar Hasan
Winnie Chau
Xuepeng (Hans) Li
Yanyan Hao
Yusuf Rahaman
Amandeep Khurana
Jeongho Park
Jugoslav Djajic
Justin Hayes
Michael Stack

More Related Content

PPTX
Time-Series Apache HBase
PDF
Argus Production Monitoring at Salesforce
PPTX
HBaseCon 2015: HBase Operations in a Flurry
PDF
Apache HBase in the Enterprise Data Hub at Cerner
PPTX
Update on OpenTSDB and AsyncHBase
PPTX
HBaseCon 2015: OpenTSDB and AsyncHBase Update
PDF
Tales from Taming the Long Tail
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Time-Series Apache HBase
Argus Production Monitoring at Salesforce
HBaseCon 2015: HBase Operations in a Flurry
Apache HBase in the Enterprise Data Hub at Cerner
Update on OpenTSDB and AsyncHBase
HBaseCon 2015: OpenTSDB and AsyncHBase Update
Tales from Taming the Long Tail
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

What's hot (20)

PPTX
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
PPTX
HBaseCon 2013: Near Real Time Indexing for eBay Search
PDF
HBaseCon2017 Highly-Available HBase
PDF
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
PDF
HBaseConAsia2018 Keynote1: Apache HBase Project Status
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
PPTX
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
PDF
hbaseconasia2017: Apache HBase at Netease
PPTX
HBaseCon 2013: ETL for Apache HBase
PDF
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
PPTX
Off-heaping the Apache HBase Read Path
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PDF
Amazon Elastic Map Reduce - Ian Meyers
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PPTX
Amazon aws big data demystified | Introduction to streaming and messaging flu...
PDF
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
Tales from the Cloudera Field
PDF
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon2017 Highly-Available HBase
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Keynote1: Apache HBase Project Status
Unified Batch & Stream Processing with Apache Samza
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
hbaseconasia2017: Apache HBase at Netease
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
Off-heaping the Apache HBase Read Path
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Amazon Elastic Map Reduce - Ian Meyers
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Amazon aws big data demystified | Introduction to streaming and messaging flu...
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
DataEngConf SF16 - Collecting and Moving Data at Scale
Tales from the Cloudera Field
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Ad

Viewers also liked (20)

PPTX
Apache HBase at Airbnb
PDF
Improvements to Apache HBase and Its Applications in Alibaba Search
PDF
Apache HBase - Just the Basics
PPTX
Apache Phoenix: Use Cases and New Features
PDF
Apache HBase Improvements and Practices at Xiaomi
PDF
Solving Multi-tenancy and G1GC in Apache HBase
PDF
Breaking the Sound Barrier with Persistent Memory
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
PPTX
Apache Kylin’s Performance Boost from Apache HBase
PPTX
Keynote: The Future of Apache HBase
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PPTX
In Search of Database Nirvana: Challenges of Delivering HTAP
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
PPTX
Real-time HBase: Lessons from the Cloud
PPTX
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
PDF
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
PDF
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Apache HBase at Airbnb
Improvements to Apache HBase and Its Applications in Alibaba Search
Apache HBase - Just the Basics
Apache Phoenix: Use Cases and New Features
Apache HBase Improvements and Practices at Xiaomi
Solving Multi-tenancy and G1GC in Apache HBase
Breaking the Sound Barrier with Persistent Memory
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache Kylin’s Performance Boost from Apache HBase
Keynote: The Future of Apache HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
In Search of Database Nirvana: Challenges of Delivering HTAP
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Real-time HBase: Lessons from the Cloud
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Ad

Similar to Rolling Out Apache HBase for Mobile Offerings at Visa (20)

PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PDF
DevOps on AWS
PDF
OpenTSDB for monitoring @ Criteo
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
PDF
Building real time data-driven products
ODP
Testing at-cloud-speed sans-app-sec-austin-2013
PDF
A step by-step process to design and manage a successful sap bi implementatio...
PPTX
HBase Backups
PDF
Enterprise Use Case Webinar - PaaS Metering and Monitoring
PDF
Architecting applications with Hadoop - Fraud Detection
PDF
23 LAMP Stack #burningkeyboards
PDF
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
PDF
Big Data Streams Architectures. Why? What? How?
PPTX
Hbase Backups: Backups in the Enterprise
PDF
Couchbase Chennai Meetup: Developing with Couchbase- made easy
PDF
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
PPT
Four Ways to Improve ASP .NET Performance and Scalability
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PPTX
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
DevOps on AWS
OpenTSDB for monitoring @ Criteo
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
Building real time data-driven products
Testing at-cloud-speed sans-app-sec-austin-2013
A step by-step process to design and manage a successful sap bi implementatio...
HBase Backups
Enterprise Use Case Webinar - PaaS Metering and Monitoring
Architecting applications with Hadoop - Fraud Detection
23 LAMP Stack #burningkeyboards
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
Big Data Streams Architectures. Why? What? How?
Hbase Backups: Backups in the Enterprise
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Four Ways to Improve ASP .NET Performance and Scalability
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn

More from HBaseCon (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
hbaseconasia2017: HBase on Beam
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
PDF
hbaseconasia2017: HBase at JD.com
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
hbaseconasia2017: hbase-2.0.0
PDF
HBaseCon2017 Democratizing HBase
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
PDF
HBaseCon2017 Transactions in HBase
PDF
HBaseCon2017 Apache HBase at Didi
PDF
HBaseCon2017 Improving HBase availability in a multi tenant environment
PDF
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: hbase-2.0.0
HBaseCon2017 Democratizing HBase
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Transactions in HBase
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

Recently uploaded (20)

PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
AI in Product Development-omnex systems
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
medical staffing services at VALiNTRY
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
history of c programming in notes for students .pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms II-SECS-1021-03
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PTS Company Brochure 2025 (1).pdf.......
AI in Product Development-omnex systems
How to Migrate SBCGlobal Email to Yahoo Easily
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
medical staffing services at VALiNTRY
Softaken Excel to vCard Converter Software.pdf
history of c programming in notes for students .pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Which alternative to Crystal Reports is best for small or large businesses.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Wondershare Filmora 15 Crack With Activation Key [2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03

Rolling Out Apache HBase for Mobile Offerings at Visa

  • 1. | HBaseCon 2016 | May 24, 20161 Rolling Out Apache HBase for Mobile Offerings at Visa Partha Saha pasaha@visa.com CW Chung cchung@visa.com
  • 2. | HBaseCon 2016 | May 24, 20162 Data loaded in real-time Over 100 Billion rows as history from most recent Milli-second response times for write/read What this talk is about – A choice of NoSQL at Visa Scale Speed Real-time
  • 3. | HBaseCon 2016 | May 24, 20163 An example of a mobile offering Add card to wallet Pay For Purchase See your transaction Right away along with recent history Need NoSQL Here
  • 4. | HBaseCon 2016 | May 24, 20164 We chose HBase as a NoSQL solution. We built a scalable and real-time Transaction History Service. We migrated prominent Mobile wallet offerings to the Service. This talk is about our learnings over the last year.
  • 5. | HBaseCon 2016 | May 24, 20165 This talk … 1. We assume some knowledge and familiarity of HBase. 2. We used HBase 1.0.0 with Cloudera Distribution CDH 5.4.3, so our observations are based on that version of HBase. 3. We cover the important learning events along the way of adoption of HBase in Visa 1. These can help new teams adopting HBase so that they avoid the same pitfalls. 2. Our learning continues as we take on more interesting and challenging opportunities.
  • 6. | HBaseCon 2016 | May 24, 20166 Is YCSB a good way to compare NoSQL options?
  • 7. | HBaseCon 2016 | May 24, 20167 It is actually not… • Unless you know how to configure your NoSQL options for optimal performance… • You may be driven to another solution, because its performance seems “smoother” and easier to explain by rudimentary knowledge. 0 20000 40000 60000 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 Series2 0 20000 40000 1 12 23 34 45 56 67 78 89 100 111 122 133 144 Series2 • It is a great tool however to observe how system configuration changes performance, and explore the configuration space for various workloads.
  • 8. | HBaseCon 2016 | May 24, 20168 Our YCSB experience… • Very easy to set up! • Got a baseline of HBase performance of the cluster. Rerun after significant configuration & application code changes. • Key parameters used: – # of client threads – # of operations – # records in Data Set – Workload mix of read/update/insert. (We added 100% insert/update workload). – Use a bash driver script to test various combinations of parameters. • Latency measurement type can be in histogram or timeseries. Both were useful.
  • 9. | HBaseCon 2016 | May 24, 20169 Should you design yourself out of major compactions?
  • 10. | HBaseCon 2016 | May 24, 201610 Not worth the trouble when you are starting… • An argument may be made that if we need an “N” day rolling look back, we can have daily tables that we create before and delete past the look back window. We can then reason about how to compact each daily file. Will that make the system operate better? • Write amplification is a well known problem and gets a lot of attention, but however, worrying about the problem during early design stages seemed like premature optimization. • We thought that we could always optimize later through rolling compactions and diurnal patterns of traffic later once patterns of reads and writes were fully understood.
  • 11. | HBaseCon 2016 | May 24, 201611 Does your design need transactional support?
  • 12. | HBaseCon 2016 | May 24, 201612 We analyzed our secondary and primary key read/writes. Primary key Fact pk1 pk2 Seconda ry key Associations sk1 {pk1} sk2 {pk1, pk2} Query keys for facts Register associations • We concluded, by tracing reads and failures through updates that inconsistencies were short lived. • We would have used a transaction support library otherwise.
  • 13. | HBaseCon 2016 | May 24, 201613 How do you hands-on learn about HBase without going into Production?
  • 14. | HBaseCon 2016 | May 24, 201614 We built a Continuous Integration and Learning Environment Build Server git/ Stash Bamboo Artifactory Client Bamboo plan Chef Client - Checkout - Build - Upload - Deploy - Run test Test Server
  • 15. | HBaseCon 2016 | May 24, 201615 How do get Operations ready for HBase in Production?
  • 16. | HBaseCon 2016 | May 24, 201616 We allocated one developer for 1 day/week to monitor production problems … Bangalore India Foster City CA, USA 1. We shadowed the real production 2. Any production problem was given priority by the whole team 3. We used 2 sites for 24x7 eyes 4. Added Alert and Monitoring dashboards 5. We launched only when when we met certain metrics
  • 17. | HBaseCon 2016 | May 24, 201617 Loading data in real-time as it is read
  • 18. | HBaseCon 2016 | May 24, 201618 We used a micro-batch approach Pre- Processor Listing & Sender Tracker Loader Master Receiver Loader Worker Batch Processor LLF Reader HBase Load Batch Processor LLF Reader HBase Load Batch Processor Stream Reader HBase Load Listing & Sender Tracker Notification Master Receiver Notification Worker Batch Processor LLF Reader HBase Registration Query Send Notification Batch Processor LLF Reader HBase Registration Query Send Notification Batch Processor Stream Reader HBase Query Send Notification IPC IPC Micro-Batch (250 ms) Control and State Files readswrites 1 per Stream 1..N per Master 1 per Stream1..N per Master Stream N Stream 2 stream1 …..... tail We had to build an approach to remember and retry from any point in each stream
  • 19. | HBaseCon 2016 | May 24, 201619 Reading via Web Servers
  • 20. | HBaseCon 2016 | May 24, 201620 The web-services Front End Audit DB MQ Config Service Access Authorization Encryption UtilityAudit Load Distribution Plugin Cache Subscription Service Failover Service BusinessComponent DataService Web Service Wrapper Rest Controller API Request Transform Response Transform Domain Objects Audit Listener HBaseAPI HBase Plugin HBase Cluster Gateway and Load Balancer
  • 21. | HBaseCon 2016 | May 24, 201621 Availability
  • 22. | HBaseCon 2016 | May 24, 201622 We used 2 data centers to get availability Data Center 1 Streams Data Center 2 Streams Replication of non-native streams We use shadow tables to write for the other when the other is down, and drain the shadow tables for the other to catch-up
  • 23. | HBaseCon 2016 | May 24, 201623 Learning your Data Center clock
  • 24. | HBaseCon 2016 | May 24, 201624 HBase is sensitive to clock skew… • Kerberos services do not tolerate more than a few minutes of clock skew. • Warnings are generated for a small skews, large skews kill region-servers.
  • 25. | HBaseCon 2016 | May 24, 201625 Client retries
  • 26. | HBaseCon 2016 | May 24, 201626 Client retries & IOExceptions • Default HBase timeout/retries settings can take tens of minutes to timeout: – hbase.rpc.timeout: 60 sec – hbase.client.retries.number: 35 – hbase.client.pause: 100 msec (grows to 10 sec quickly after back-off) – Longer when factor in potential retries by zookeeper! – Blogs by Lars Hofhansl: “HBase Client timeouts”, “HBase client response times” • We choose Fail Fast strategy, as end user device will do end-to-end retry. • Timeout/retries settings: 1 sec timeout, 3 total tries. – Works well for the same data center, as well as across data centers • However, once a while, clients see IOExceptions! – Caused by Region Server (busy in GC, major/minor compaction, … ?) – Or the Network? – Or the Client itself?
  • 27. | HBaseCon 2016 | May 24, 201627 Correlating client exceptions
  • 28. | HBaseCon 2016 | May 24, 201628 Correlating client exceptions • Client side: – Turn on hbase client debugging: • log4j.logger.org.apache.hbase.client=DEBUG • log4j.logger.org.apache.hbase.ipc=DEBUG – Catch the exceptions to print out specific Region Server name: • IOException, RetriesExhaustedWithDetailsException • Server side: – Then look into the specific Region Server log of that server. • Works well when you know the specific server causing the IOExceptions. – What if not?
  • 29. | HBaseCon 2016 | May 24, 201629 Correlating client exceptions • Build Root Cause Analysis software to: – Collect the relevant logs from the sources: • Client: application logs, hbase client logs, GC logs • Hadoop server: HBase, HDFS, Zookeeper server and GC logs • Cluster events: Cloudera Manage API • Other logs: KDC logs, Kerberos canary, network latency monitoring – Parse the logs (single line, multi-line text, json, xml) into csv files. – Normalize data and time format, apply date and time range filtering. – Apply text filtering and text reduction on verbose lines. – Output: events csv, sorted by time and server, suitable for grep/awk/sort, hive/sql. • Quickly get an total view of the sequence of events of various services. • Sometime can identify the smoking gun (e.g. exception caused by GC ). • Still useful in the few cases when no smoking gun can be found! – Trouble-shooting is also a process of elimination.
  • 30. | HBaseCon 2016 | May 24, 201630 Kerberos Gotchas
  • 31. | HBaseCon 2016 | May 24, 201631 Kerberos Gotchas – what we have learned • Hostname uses FQDN (Fully Qualified Domain Name, like server123.abc.com) • Use TCP rather than UDP (set udp_preference_limit = 1 in krb5.conf) • KDC (MIT Kerberos) server: – Configure to start up several kdc processes to handle bursty traffic (use –w option). – Set up a backup kdc for higher availability. • Debugging tips: – $ export KRB5_TRACE=/dev/stderr (or to a file) – $ log4j: -Dsun.security.krb5.debug=true • Kerberos support is built into the Java JRE, using internal classes: – Oracle JDK: com.sun classes; on IBM AIX: com.ibm – Hadoop is built and tested against Oracle JDK ( mileage on AIX JDK varies). • Good references (besides the usual documents on Kerberos, and HBase User mailing list): – Steve Loughran: Hadoop and Kerberos: The Madness beyond the Gate. – HBase and Hadoop common source code: UserGroupInformation.java.
  • 32. | HBaseCon 2016 | May 24, 201632 Kerberos Gotchas – what we learned – Renewing a TGT Ticket (Ticket Granting Ticket) • After kinit successfully, application principal gets a Kerberos TGT ticket. • By default, the TGT ticket is good for 10 hours. • For long-running applications, 10 hours obviously is not enough: need to renew TGT. • Initially uses a process/thread to do a kinit once every few hours. – Still ran into some IOExceptions at the time of TGT of renewal. – Not the recommended way for long-running applications. • Now uses UGI API (UserGroupInformation): loginUserFromKeytab( ). – Does not require a separate process/thread to do TGT renewal. – Hadoop/HBase client class library will catch the exception due to TGT expiration, and will do a reloginFromKeytab( ) to renew TGT automatically. – Also considering spawn a thread and proactively invoke CheckTGTAndRelogin( ). – Ongoing investigation: client occasionally still experiencing momentary IOException around the time ticket renewal. – Referral Ticket: when on realm is set up to trust another realm, be aware of the additional kdc calls resulted when the kinit principal is from the trusted realm.
  • 33. | HBaseCon 2016 | May 24, 201633 Garbage Collection
  • 34. | HBaseCon 2016 | May 24, 201634 Garbage Collection • Use G1 on Oracle JDK 1.8 • Basically using settings as recommended by folks from HBaseCon2015. – By Eric Kaczmarek, Yanping Wang, Liqi Yi • Set target GC pause to 100 msec; Young Gen to ~1GB. • Observation consistent with their published results: – Observed gc time in production: • 100 msec or less: 67% • 400 msec or less: 99.98% • Important to track the actual production gc time, as Production and Test cluster shows somewhat different distribution.
  • 35. | HBaseCon 2016 | May 24, 201635 GC Duration comparison: production vs perf cluster
  • 36. | HBaseCon 2016 | May 24, 201636 GC: How Good is MaxGCPauseMillis as a Target? MaxGCPauseMillis = 100 Production Cluster (gc in msec) Test Cluster (gc in msec) # of gc events 165192 199883 Avg / Std Dev / Max 87.1 / 64.9 / 1530 msec 81.9 / 37.2 / 1370 msec 50 percentile (median) 80 msec 90 msec 95 percentile / 99% / 99.9% / 99.99% 210 msec / 270 / 450 / 660 msec 120 msec / 140 / 510 / 780 msec Percentile of: 100 msec / 200 / 300 / 400 msec 67% / 95% / 99.4% / 99.8% 85% / 99.4% / 99.6% / 99.8%
  • 37. | HBaseCon 2016 | May 24, 201637 In Conclusion…
  • 38. | HBaseCon 2016 | May 24, 201638 Adopting an open source product is a journey… • Learning from previous adoption successes is crucial – if use case has not been tried/analyzed/written about before, chances are we have to pay for learning and having alternate choices is a good idea. • Making only one major technology change at a time is always a good idea. • Setting up appropriate expectations through team members and agile processes is important. • Going to production scenario early as shadow and learning through frequent releases is helpful. • We believe extra capacity for peak workloads was very helpful. • Having source code is very useful in learning and trouble-shooting.
  • 39. | HBaseCon 2016 | May 24, 201639 It Takes a Village! Thank you! Alexandr Peyko Amit Sharma Anthony Chu Arindam Chakraborty Artem Savinov Aviral Agarwal Bala Saravanan Kannan Ben Crane Carl Duque Chetan Talanki Debasis Mullick Deepankar Palit Hong Zhu Igor Karpenko Igor Peller Igor Ulianitski Jay Gardner Jim Gordon Karthikeyan Manickavasagan Liang Gao Murali Reddy Nandakumar Jayakumar Nimish Shah Peter Meigs Pradyot Sikdar Praveen Rudraraju Rajat Raj Raj Merchia Ralph Blore Ranjan Dutta Ricardo De Ocampo Domingo Robert Walsh Sabu Peter Sam Hamilton Sandeep Reddy Satyaban Nandi Soumya Das Srijoy Aditya Srinivas Reddy Surasani Suchismita Nayak Suresh Pulikara Ujjwal Kumar Vikash Talanki Vinay Sarda Waqar Hasan Winnie Chau Xuepeng (Hans) Li Yanyan Hao Yusuf Rahaman Amandeep Khurana Jeongho Park Jugoslav Djajic Justin Hayes Michael Stack