SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved.
Wellington Chevreuil
© Cloudera, Inc. All rights reserved.
Agenda
• Common types of problems
• Most affected features
• Common reasons
• Case stories review
• General best practices
© Cloudera, Inc. All rights reserved.
Common types of problems
• RegionServers/Master crashes
• Master timing out during initialisation
• Performance
• Client operation slowness
• Master slow initialisation
• Regions stuck in transition (RIT)
• Data unavailable
• Errors can reflect on client applications
• Can evolve to hbase service outage
• FileSystem usage
• HBase exhausting hdfs usage
• Corruption / Data loss / Replication data consistency
© Cloudera, Inc. All rights reserved.
Most affected features or sub-systems
• AssignmentManager
• RIT
• Performance (Master initialisation issues)
• Replication
• Space usage exhaustion
• Replication consistency
• Snapshot
• Space usage exhaustion
• Memstore, RPC sub-system, Compaction/Region Splits
• Performance
• WAL/StoreFile codecs
• Data loss, corruption
© Cloudera, Inc. All rights reserved.
Common reasons
• Performance
• Overload/under-dimensioned cluster
• Too many regions/RS
• Small RSes heaps;
• Non optimal configurations:
• May require GC and other JVM config tunings
• HBase specific adjusts, such as flush and caches sizes, compaction frequency, region limits, handlers
count, etc;
• Crashes
• Memory exhaustion
• File system issues
• Known bugs
• RIT
• Bugs
• Self induced (wrong hbck commands triggered)
© Cloudera, Inc. All rights reserved.
Common reasons
• RIT
• Can also happen as side effect of performance, crash or corruption issues
• File system usage exhaustion
• Replication related issues
• Too many snapshots
• Corruption / Data loss / Replication data consistency
• Bugs
• Faulty peers / custom or third party components
• FileSystem problems
© Cloudera, Inc. All rights reserved.
Case story - RegionServers slow/crashing randomly
Type: Process crash | Service outage | Performance
Feature: RegionServer core resource management
Reason: Long GC pauses, due to mismatching heap sizes and workloads
Diagnosing:
• Frequent JvmPauseMonitor alerts on RSes logs;
• Occasionally OOME on stdout;
• Too many regions per RS (more than 200 regions);
• JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole)
Resolution:
• Initially increase the heap size, but CMS may experience slowness with large heaps.
• For heaps larger than 20GB, general G1 recommendations from Cloudera engineering
blog post had provided good results.
• Horizontal scaling by adding more RSes.
© Cloudera, Inc. All rights reserved.
Case story - Slow scans, compactions delayed
Type: Performance.
Feature: Internal scanning.
Reason: PrefixTree HFile encoding issues (HBASE-17375).
Diagnosing:
• Compaction queue piling up;
• jstacks from RSes show below trace over several frames:
"regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0
tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000]
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214)
at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628)
Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding.
Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards).
© Cloudera, Inc. All rights reserved.
Case story - My HBase is slow
Type: Performance
Feature: Client Read/Write operations
Reason:
• Dependency services underperforming:
• Zookeeper fsync issue
• HDFS reads slowness
• RPC encryption
• Poor client implementation not reusing connections.
• Faulty CPs or custom filters
Diagnosing:
• Client application/RSes jstacks
• General HBase stats such as: compaction queue size, data locality, cache hit ratio;
• HDFS/ZK logs
Resolution: Usually require tunings on dependency services or redesign of
client application/custom CPs/Filters
© Cloudera, Inc. All rights reserved.
Case story - Client scans failing | HBCK reports
inconsistencies
Type: RIT
Feature: AssignmentManager
Reason: Various
• Misuse of hbck (branch 1) can break hlinks;
• Snapshots cold backups out of hdfs;
• Busy/overloaded clusters where regions keep moving constantly
Diagnosing:
• Evident from hbck reports/Master Web UI.
• Master logs would show RS opening/hosting region.
• RS holding region should have relevant error message logs
Resolution:
• There's no single recipe.
• Each case may require a combination of hbck/hbck2 commands.
© Cloudera, Inc. All rights reserved.
Case story - Master timing out during initialisation
Type: Master Crash | Service Outage
Feature: Procedures Framework
Reason: Different bugs can cause procedures to pile up:
• HBASE-22263, HBASE-16488, HBASE-18109
Diagnosing:
• Listing "/hbase/MasterProcWALs" shows hundreds or more files.
• Master times out and crashes before assigning namespace region.
Resolution:
• Stop Master and clean "/hbase/MasterProcWALs" folder
• Caution, specially when on hbase > 2.x releases
Mitigation: Increase init timeout, number of open region threads.
ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after
© Cloudera, Inc. All rights reserved.
Case story - Replication lags
Type: Replication stuck
Feature: Replication data consistency | HBase exhausting hdfs usage
Reason: Single WAL entries with too many OPs, leading to RPCs larger than
"hbase.ipc.server.max.callqueue.size"
Diagnosing: Destination peer RSes showing type of log messages below
Resolution: requires wal copy and replay from source to destination, plus
manual znode cleanup
Mitigation: releases including HBASE-18027 would prevent this situation
2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops, last exception:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is
hbase.ipc.server.max.callqueue.size too small? on region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not
retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
© Cloudera, Inc. All rights reserved.
Case story - Client scan failing on specific regions
Type: HFile Corruption
Feature: Snappy compression
Reason: Unknown
Diagnosing: Following errors when scanning specific regions
Or
Resolution: Requires sideline of affecting files and re-ingestion of row keys
stored on this file. Potential data loss.
java.lang.InternalError: Could not decompress data. Input is invalid.
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at
© Cloudera, Inc. All rights reserved.
Case story - HBase is eating HDFS space
Type: FileSystem usage
Feature: Replication | Snapshot | Compaction | Cleaners
Reason: Various
• Replication: Slow, faulty or disabled peer, missing tables on remote peer.
• Snapshot: Too many snapshots being retained.
• Compaction and Cleaners threads stuck or not running.
Diagnosing:
• Check usage for "archive" and "oldWALS".
• Master logs would show if cleaner threads are running.
• Is replication stuck or lagging?
• How about snapshot retention policy?
Resolution:
• If cleaner threads are not running, restart master.
• For disabled peers, only enabling it again, or remove it, if no replication is wanted.
• Too many snapshots would require some cleaning or cold backup.
• Replication lags reason may vary, source RSes logs should show errors from replication source threads.
© Cloudera, Inc. All rights reserved.
General best practices
• Heap usage monitoring
• Keep regions/RS on low hundreds
• Consider GC for heaps > 20GB
• Data locality
• Adjust caching according to workload
• Compaction Policy (Consider offline compactions using CompactionTool)
• Consider an exclusive Zookeeper for HBase
• Adjust Master initialization timeout accordingly
• Consider increase number of "open region" handlers
• Define reasonable snapshot retention policy
• Caution with experimental/non stable features (snappy/prefixtree)
• Define deployment policy for custom applications/CPs/Filters
• Define patch/bug fix upgrades schedule
© Cloudera, Inc. All rights reserved.
Q&A

More Related Content

PDF
HBase replication
PPTX
Alfresco tuning part2
PDF
Guide to alfresco monitoring
PDF
Breaking the Sound Barrier with Persistent Memory
PPTX
HBase Low Latency
PPTX
Operating and supporting HBase Clusters
PPT
HBase at Xiaomi
PPTX
HBase Coprocessor Introduction
HBase replication
Alfresco tuning part2
Guide to alfresco monitoring
Breaking the Sound Barrier with Persistent Memory
HBase Low Latency
Operating and supporting HBase Clusters
HBase at Xiaomi
HBase Coprocessor Introduction

What's hot (18)

PPTX
Fastest Servlets in the West
PPTX
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
PPTX
HBaseCon 2013: Apache HBase Table Snapshots
PDF
Caching with Memcached and APC
PDF
HBase Replication for Bulk Loaded Data
PPTX
Alfresco tuning part1
PPTX
Oscon 2011 - ATS
PDF
03 h base-2-installation_andshell
PPTX
Built in physical and logical replication in postgresql-Firat Gulec
PDF
Tomcatx performance-tuning
PDF
VMworld 2013: Big Data: Virtualized SAP HANA Performance, Scalability and Bes...
PDF
VMworld 2013: Extreme Performance Series: Monster Virtual Machines
PPTX
HBaseCon 2013: A Developer’s Guide to Coprocessors
PPTX
Hadoop at Bloomberg:Medium data for the financial industry
PPTX
PHP Performance with APC + Memcached
PDF
hbaseconasia2017: hbase-2.0.0
PDF
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
PDF
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
Fastest Servlets in the West
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: Apache HBase Table Snapshots
Caching with Memcached and APC
HBase Replication for Bulk Loaded Data
Alfresco tuning part1
Oscon 2011 - ATS
03 h base-2-installation_andshell
Built in physical and logical replication in postgresql-Firat Gulec
Tomcatx performance-tuning
VMworld 2013: Big Data: Virtualized SAP HANA Performance, Scalability and Bes...
VMworld 2013: Extreme Performance Series: Monster Virtual Machines
HBaseCon 2013: A Developer’s Guide to Coprocessors
Hadoop at Bloomberg:Medium data for the financial industry
PHP Performance with APC + Memcached
hbaseconasia2017: hbase-2.0.0
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
Ad

Similar to HBase tales from the trenches (20)

PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PDF
Real-time Big Data Analytics Engine using Impala
PPTX
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
PPTX
Inside MapR's M7
PPTX
Inside MapR's M7
PDF
Building an Impenetrable ZooKeeper - Kathleen Ting
PPTX
HBase Operations and Best Practices
PDF
Troubleshooting Hadoop: Distributed Debugging
PPTX
Strata London 2019 Scaling Impala.pptx
PDF
Strata London 2019 Scaling Impala
PDF
Tales from the Cloudera Field
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
PPTX
HBase Low Latency, StrataNYC 2014
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
PPTX
Operating and Supporting Apache HBase Best Practices and Improvements
PPTX
HDFS: Optimization, Stabilization and Supportability
PPTX
Hdfs 2016-hadoop-summit-dublin-v1
PPTX
HBase coprocessors, Uses, Abuses, Solutions
HBase Tales From the Trenches - Short stories about most common HBase operati...
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Real-time Big Data Analytics Engine using Impala
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Inside MapR's M7
Inside MapR's M7
Building an Impenetrable ZooKeeper - Kathleen Ting
HBase Operations and Best Practices
Troubleshooting Hadoop: Distributed Debugging
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala
Tales from the Cloudera Field
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
HBase Low Latency, StrataNYC 2014
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Operating and Supporting Apache HBase Best Practices and Improvements
HDFS: Optimization, Stabilization and Supportability
Hdfs 2016-hadoop-summit-dublin-v1
HBase coprocessors, Uses, Abuses, Solutions
Ad

More from wchevreuil (9)

PDF
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
PDF
HBase System Tables / Metadata Info
PDF
HDFS client write/read implementation details
PDF
HBase RITs
PPTX
Hbasecon2019 hbck2 (1)
PDF
Web hdfs and httpfs
PPT
Hadoop tuning
PPT
I nd t_bigdata(1)
PDF
Hadoop - TDC 2012
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
HBase System Tables / Metadata Info
HDFS client write/read implementation details
HBase RITs
Hbasecon2019 hbck2 (1)
Web hdfs and httpfs
Hadoop tuning
I nd t_bigdata(1)
Hadoop - TDC 2012

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
System and Network Administraation Chapter 3
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Online Work Permit System for Fast Permit Processing
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPT
Introduction Database Management System for Course Database
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
AI in Product Development-omnex systems
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
System and Network Administration Chapter 2
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PTS Company Brochure 2025 (1).pdf.......
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
System and Network Administraation Chapter 3
Wondershare Filmora 15 Crack With Activation Key [2025
Online Work Permit System for Fast Permit Processing
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Understanding Forklifts - TECH EHS Solution
Navsoft: AI-Powered Business Solutions & Custom Software Development
Introduction Database Management System for Course Database
How Creative Agencies Leverage Project Management Software.pdf
AI in Product Development-omnex systems
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
System and Network Administration Chapter 2
Design an Analysis of Algorithms I-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025

HBase tales from the trenches

  • 1. © Cloudera, Inc. All rights reserved. Wellington Chevreuil
  • 2. © Cloudera, Inc. All rights reserved. Agenda • Common types of problems • Most affected features • Common reasons • Case stories review • General best practices
  • 3. © Cloudera, Inc. All rights reserved. Common types of problems • RegionServers/Master crashes • Master timing out during initialisation • Performance • Client operation slowness • Master slow initialisation • Regions stuck in transition (RIT) • Data unavailable • Errors can reflect on client applications • Can evolve to hbase service outage • FileSystem usage • HBase exhausting hdfs usage • Corruption / Data loss / Replication data consistency
  • 4. © Cloudera, Inc. All rights reserved. Most affected features or sub-systems • AssignmentManager • RIT • Performance (Master initialisation issues) • Replication • Space usage exhaustion • Replication consistency • Snapshot • Space usage exhaustion • Memstore, RPC sub-system, Compaction/Region Splits • Performance • WAL/StoreFile codecs • Data loss, corruption
  • 5. © Cloudera, Inc. All rights reserved. Common reasons • Performance • Overload/under-dimensioned cluster • Too many regions/RS • Small RSes heaps; • Non optimal configurations: • May require GC and other JVM config tunings • HBase specific adjusts, such as flush and caches sizes, compaction frequency, region limits, handlers count, etc; • Crashes • Memory exhaustion • File system issues • Known bugs • RIT • Bugs • Self induced (wrong hbck commands triggered)
  • 6. © Cloudera, Inc. All rights reserved. Common reasons • RIT • Can also happen as side effect of performance, crash or corruption issues • File system usage exhaustion • Replication related issues • Too many snapshots • Corruption / Data loss / Replication data consistency • Bugs • Faulty peers / custom or third party components • FileSystem problems
  • 7. © Cloudera, Inc. All rights reserved. Case story - RegionServers slow/crashing randomly Type: Process crash | Service outage | Performance Feature: RegionServer core resource management Reason: Long GC pauses, due to mismatching heap sizes and workloads Diagnosing: • Frequent JvmPauseMonitor alerts on RSes logs; • Occasionally OOME on stdout; • Too many regions per RS (more than 200 regions); • JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole) Resolution: • Initially increase the heap size, but CMS may experience slowness with large heaps. • For heaps larger than 20GB, general G1 recommendations from Cloudera engineering blog post had provided good results. • Horizontal scaling by adding more RSes.
  • 8. © Cloudera, Inc. All rights reserved. Case story - Slow scans, compactions delayed Type: Performance. Feature: Internal scanning. Reason: PrefixTree HFile encoding issues (HBASE-17375). Diagnosing: • Compaction queue piling up; • jstacks from RSes show below trace over several frames: "regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0 tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214) at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628) Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding. Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards).
  • 9. © Cloudera, Inc. All rights reserved. Case story - My HBase is slow Type: Performance Feature: Client Read/Write operations Reason: • Dependency services underperforming: • Zookeeper fsync issue • HDFS reads slowness • RPC encryption • Poor client implementation not reusing connections. • Faulty CPs or custom filters Diagnosing: • Client application/RSes jstacks • General HBase stats such as: compaction queue size, data locality, cache hit ratio; • HDFS/ZK logs Resolution: Usually require tunings on dependency services or redesign of client application/custom CPs/Filters
  • 10. © Cloudera, Inc. All rights reserved. Case story - Client scans failing | HBCK reports inconsistencies Type: RIT Feature: AssignmentManager Reason: Various • Misuse of hbck (branch 1) can break hlinks; • Snapshots cold backups out of hdfs; • Busy/overloaded clusters where regions keep moving constantly Diagnosing: • Evident from hbck reports/Master Web UI. • Master logs would show RS opening/hosting region. • RS holding region should have relevant error message logs Resolution: • There's no single recipe. • Each case may require a combination of hbck/hbck2 commands.
  • 11. © Cloudera, Inc. All rights reserved. Case story - Master timing out during initialisation Type: Master Crash | Service Outage Feature: Procedures Framework Reason: Different bugs can cause procedures to pile up: • HBASE-22263, HBASE-16488, HBASE-18109 Diagnosing: • Listing "/hbase/MasterProcWALs" shows hundreds or more files. • Master times out and crashes before assigning namespace region. Resolution: • Stop Master and clean "/hbase/MasterProcWALs" folder • Caution, specially when on hbase > 2.x releases Mitigation: Increase init timeout, number of open region threads. ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after
  • 12. © Cloudera, Inc. All rights reserved. Case story - Replication lags Type: Replication stuck Feature: Replication data consistency | HBase exhausting hdfs usage Reason: Single WAL entries with too many OPs, leading to RPCs larger than "hbase.ipc.server.max.callqueue.size" Diagnosing: Destination peer RSes showing type of log messages below Resolution: requires wal copy and replay from source to destination, plus manual znode cleanup Mitigation: releases including HBASE-18027 would prevent this situation 2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops, last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure 2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
  • 13. © Cloudera, Inc. All rights reserved. Case story - Client scan failing on specific regions Type: HFile Corruption Feature: Snappy compression Reason: Unknown Diagnosing: Following errors when scanning specific regions Or Resolution: Requires sideline of affecting files and re-ingestion of row keys stored on this file. Potential data loss. java.lang.InternalError: Could not decompress data. Input is invalid. at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method) at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at
  • 14. © Cloudera, Inc. All rights reserved. Case story - HBase is eating HDFS space Type: FileSystem usage Feature: Replication | Snapshot | Compaction | Cleaners Reason: Various • Replication: Slow, faulty or disabled peer, missing tables on remote peer. • Snapshot: Too many snapshots being retained. • Compaction and Cleaners threads stuck or not running. Diagnosing: • Check usage for "archive" and "oldWALS". • Master logs would show if cleaner threads are running. • Is replication stuck or lagging? • How about snapshot retention policy? Resolution: • If cleaner threads are not running, restart master. • For disabled peers, only enabling it again, or remove it, if no replication is wanted. • Too many snapshots would require some cleaning or cold backup. • Replication lags reason may vary, source RSes logs should show errors from replication source threads.
  • 15. © Cloudera, Inc. All rights reserved. General best practices • Heap usage monitoring • Keep regions/RS on low hundreds • Consider GC for heaps > 20GB • Data locality • Adjust caching according to workload • Compaction Policy (Consider offline compactions using CompactionTool) • Consider an exclusive Zookeeper for HBase • Adjust Master initialization timeout accordingly • Consider increase number of "open region" handlers • Define reasonable snapshot retention policy • Caution with experimental/non stable features (snappy/prefixtree) • Define deployment policy for custom applications/CPs/Filters • Define patch/bug fix upgrades schedule
  • 16. © Cloudera, Inc. All rights reserved. Q&A