HBase replication

© Cloudera, Inc. All rights reserved.
HBase Replication
Wellington Chevreuil

Overview
● Replication Basics
● Requirements
● HBase Shell Commands
● Implementation Details
● Monitoring
● Extra Tools
● Hands-on labs

Replication Basics
● Source-push strategy
● Master, Source, Originator - means the cluster sending data.
● Slave, Destination, Target - means cluster receiving data.
● Can be cyclic and allows for multiple masters and slaves
○ A master can have multiple slaves
○ A slave can have multiple masters
○ A cluster can perform both master/slave roles on a given topology
● Eventual consistency
● Asynchronous
● Configurable at column family level
● Relies on WAL data
○ Any changes that bypass WAL won't be replicated, such as bulk load, truncate command, or if skip wal
has been enabled.
● Tracked via ZooKeeper
● Work done by RegionServers
● Adds a source cluster ID to edit's metadata

Requirements
● All RegionServers must be accessible from all RegionServers from each cluster
● Zookeeper Quorum from slaves must be accessible by masters
● Table structure must be the same in master and slave clusters
○ The column family target for replication must match on master/slave clusters
● If same Zookeeper Quorum is used for master/slave clusters,
zookeeper.znode.parent must be different
● Clusters can have varying sizes
● Clusters can have pre-existing data on target tables
○ In this case, only data added on master after replication has been enabled will be replicated

HBase Shell Commands
● add_peer
○ Sets a new slave to the current cluster.
● list_peers
○ Shows current list of slaves "known" by this cluster.
● disable_peer
○ Pause replication, but stays tracking new edits to be replicated.
● enable_peer
○ Resumes replication. All edits added since disable_peer execution will now be sent to related
slaves.
● remove_peer
○ Disables replication for the given slave.
○ No edits will be sent to the slave.

HBase Shell Commands
● enable_table_replication
○ Sets replication flag as true on all column families from specified table.
● disable_table_replication
○ The opposite from the above.
● append_peer_tableCFs, remove_peer_tableCFs, set_peer_tableCFs,
show_peer_tableCFs, update_peer_config, get_peer_config, list_peer_configs,
list_replicated_tables.
○ General admin commands that allow for changing/monitoring configuration of tables currently
targeted for replication

Implementation Details - Deployment Overview
● This is a deployment diagram
in the context of replication
only, so only major replication
flow relevant components are
highlighted.
● Note no presence of HMasters
either on master (source) or
slave (destination) clusters.
● Zookeeper is of vital
importance, as it keeps the
registry of edits to be
replicated, as well as peers to
replicate to.
● RSes on Master cluster depend
on ZK from Slave cluster.

Implementation Details - Setup/Maintenance commands
● Shell commands interact directly with Zookeeper.
● Replication is kept on master cluster's Zookeeper znodes.
● No interaction within RSes when replication shell commands are ran.

Implementation Details - Setup WAL and Replication
● RS init phase where
replication service classes are
created.
● Once replication related
classes are properly
initialized, Replication
instance is added to the list
of WALActionListener.
● WALFactory instance is
created, with the list of
listeners containing
Replication instance.

● Replication related classes are only initialised if "hbase.replication" is set to true.
● This will happen between the following log messages from RS startup logs:
● Replication Source/Sink implementation default: org.apache.hadoop.hbase.replication.regionserver.Replication
○ This is configurable by hbase.replication.source.service and hbase.replication.source.service
INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=...
INFO org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl: Added new peer
cluster=remote_peer_host:2181:/hbase
INFO org.apache.hadoop.hbase.wal.WALFactory: Instantiating WALProvider of type class
org.apache.hadoop.hbase.wal.BoundedRegionGroupingProvider
Watch out for possible customer
specific configurations

● During WAL related classes creation, WAL file is rolled.
● Replication was added as a WAL listener before, so ReplicationSourceManager will be
notified about log roll.
● Using Zookeeper, ReplicationSourceManager adds the new WAL file to the queue of
logs (this will be under replication znodes).

● Over WAL file rolling, no replication specific log message is recorded.
● ReplicationSourceManager code will be notified about new WAL file creation
between below messages:
INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: WAL configuration: blocksize=128 MB, ...
...
INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: New WAL /hbase/WALs/…
….

● Potential errors involving replication on this phase will be mostly related to znodes
access, preventing ZK queue from being initialized:
ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.io.IOException: Failed replication handler create
at org.apache.hadoop.hbase.replication.regionserver.Replication.initialize(Replication.java:130)
at org.apache.hadoop.hbase.regionserver.HRegionServer.newReplicationInstance(HRegionServer.java:2662)
at org.apache.hadoop.hbase.regionserver.HRegionServer.createNewReplicationInstance(HRegionServer.java:2632)
at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1647)
at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1388)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:918)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.replication.ReplicationException: Could not initialize replication queues.
at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.init(ReplicationQueuesZKImpl.java:85)
at org.apache.hadoop.hbase.replication.regionserver.Replication.initialize(Replication.java:122)
... 6 more

Implementation Details - Start Replication Thread
● From HRegionServer.startServiceThreads method, replication source and sink
threads are set and started.
● ReplicationSourceManager initialization involves several steps, to be detailed next.
● ReplicationSink instance will be used to perform the actual sink if the cluster act as a
destination cluster. To be detailed later.

● Once ReplicationSourceManager.addSource completed properly for each peer,
following message would be seen:
● Upon startup, ReplicationSource.run method will also log below message:
● Since this is asynchronously, it may occur before or after the previous message.
● It should be logged for each peer id.
INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Current list of replicators:
[host-1,60020,1510938412878, host1,60020,1510929825829] other RSs: [host-1,60020,1510938412878]
…
INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating
9fa10771-97b2-48ed-b635-b0bd474a99b2 -> 5f54f936-a5f8-4726-9d09-7bf1c709eeab

Implementation Details - New Peers
● ReplicationTrackerZKImpl receives notification about changes on replication znodes.
● New peer addition triggers peer list update on ReplicationPeersZKImpl.
● With at least one peer, ReplicationQueuesZKImpl will get notified about WAL file
creation.
INFO org.apache.hadoop.hbase.replication.ReplicationTrackerZKImpl: /hbase/replication/peers znode expired, triggering
peerListChanged event
...
INFO org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl: Added new peer cluster=peer-host:2181:/hbase

Implementation Details - Shipping Edits
● Main work done by ReplicationSourceWorkerThread instances.
○ One per WAL group
○ Every WAL group has its own queue of WAL files to be processed.
○ Runs in the background indefinitely. Will sleep for replication.source.sleepforretries if peer is
disabled.
○ On each loop iteration:
■ Reads current WAL being written.
■ Apply editlog filters (get only edits for CFs marked for replication, whose cluster origin ID is not same as peer).
■ For editlogs filtered, connect to a RS on the remote cluster and send those (via RPC).
■ Edits must be read (and processed) sequentially. If shipment fails, replication will not progress for that WAL
group, and lags may be seen

Implementation Details - Shipping Edits (Source Side)

● HBaseInterClusterReplicationEndpoint.replicate() method detailed flow
● Uses its own thread pool for performing RPC calls
● Replicator class implements java.util.concurrent.Callable for async execution.

● Replicator uses SinkPeer to discover remote RS responsible to run the sink.
● ReplicationProtbufUtil is used for convert request to protobuff and perform RPC.

Implementation Details - Shipping Edits (Destination Side)
● ReplicationSink uses default client API to process put/delete operations.
● Not necessarily the RS running the sink is the same for the regions where entries will
be placed.
● Coprocessors may get invoked.

Monitoring
● Some classes provide additional TRACE/DEBUG messages that can be turned on for
further troubleshooting.
● Worth enable it using RS UI for specific classes only, instead of turn TRACE to whole
HBase service:
○ ReplicationSource, HBaseReplicationEndpoint, HBaseInterClusterReplicationEndpoint,
● JMX Metrics might also help get a state of replication:
○ shippedBatches, AgeOfLastShippedOP, logReadInBytes.
■ Global and per WAL group id.
● ReplicationStatisticsThread also logs replication stats every 5 minutes:
IINFO org.apache.hadoop.hbase.replication.regionserver.Replication: Normal source for cluster 1: Total replicated edits: 2, current progress:
walGroup [host-1%2C60020%2C1511034265841.null0]: currently replicating from:
hdfs://nameservice1/hbase/WALs/host-1,60020,1511034265841/host-1%2C60020%2C1511034265841.null0.1511196279542 at position: 83

Monitoring
● HBase shell status 'replication' command:
○ On source cluster:
○ On destination cluster:
1 live servers
Host-10-17-101-41.coe.cloudera.com:
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=0, TimeStampsOfLastShippedOp=Mon Nov 20 10:02:05 PST 2017, Replication Lag=0
SINK : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Sat Nov 18 11:49:29 PST 2017
1 live servers
Host-10-17-103-206.coe.cloudera.com:
SOURCE:
SINK : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Mon Nov 20 08:40:19 PST 2017

Monitoring
● VerifyReplication
○ MR job that compares the records for the table in source and destination cluster.
○ Prints counter within its findings:
1 test-1
...
17/11/20 10:43:12 INFO mapreduce.Job: map 0% reduce 0%
17/11/20 10:43:24 INFO mapreduce.Job: Job job_1506585949780_0005 completed successfully
…
org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
BADROWS=25
GOODROWS=11
ONLY_IN_SOURCE_TABLE_ROWS=25
...

Monitoring
● DumpReplicationQueues
hbase org.apache.hadoop.hbase.replication.regionserver.DumpReplicationQueues --distributed
...
Dumping replication peers and configurations:
Peer: 2
State: ENABLED
Cluster Name:
clusterKey=host-10-17-103-187.coe.cloudera.com,host-10-17-103-189.coe.cloudera.com,host-10-17-103-193.coe.cloudera.com:2181:/hbase,replicationEndpoint
Impl=null
Peer Table CFs: null
…
Dumping replication queue info for RegionServer: [host-10-17-101-41.coe.cloudera.com,60020,1511971261591]
replication queue: 1
Replication position for host-10-17-101-41.coe.cloudera.com%2C60020%2C1511971261591.null0.1512140473468: 13227
...

Extra Tools
● In case data is already available on either source/destination cluster tables, some
tools can be used to sync data:
○ CopyTable
■ https://hbase.apache.org/book.html#copy.table
○ Export Snapshots
■ https://hbase.apache.org/book.html#ops.snapshots.export
○ Bulk Load
■ https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
○ HashTable/SyncTable
■ Now documented here.
■ Best option, can be used even after replication is already enabled.
■ Allows for syncing deleted rows.
■ Only available from CDH 5.9.0 onwards

Extra Tools
● HashTable/SyncTable:
○ Two MR jobs
■ org.apache.hadoop.hbase.mapreduce.HashTable
■ org.apache.hadoop.hbase.mapreduce.SyncTable
○ Usage:
■ First, run HashTable MR job on the cluster whose state should be propagated to the remote peer. For example, if
we want to sync table "test-1" state on destination cluster with state from source cluster, run below at source:
● Where first param is the table name, and second param is an hdfs path where HashTable job should
output table's summary
$ hbase org.apache.hadoop.hbase.mapreduce.HashTable test-1 /tmp/test-1

Extra Tools
● HashTable/SyncTable:
○ Usage
■ Once HashTable has finished on source cluster, run SyncTable on destination cluster:
■ First and second params are the ZK address and NN address of source cluster, respectively
■ Last two params are the table names on source and destination cluster
■ This command would cause the table data on destination cluster to be in sync with the source
cluster
● If source cluster had more rows prior to the command, these additional rows would be
copied to destination.
● If destination cluster had more rows then source, these rows would be deleted from
destination.
$ hbase org.apache.hadoop.hbase.mapreduce.SyncTable --sourcezkcluster=source_zk:2181:/hbase hdfs://source_nn:8020/tmp/test-1 test-1 test-1

Labs Exercises
1. Problem 1: Replication related znodes not readable by RSes
2. Problem 2: Remote cluster not reachable by source cluster
3. Problem 3: Remote cluster is reachable, but sinks are not completing

HBase replication

More Related Content

What's hot (20)

Similar to HBase replication (20)

More from wchevreuil (9)

Recently uploaded (20)

HBase replication