SlideShare a Scribd company logo
HBaseCon, May 2012

HBase Coprocessors
Lars George | Solutions Architect
Revision History

Version      Revised By                                    Description of Revision
Version 1    Lars George                                   Initial version




2                     ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                     Reproduction or redistribution without written permission is
                                             prohibited.
Overview

•  Coprocessors were added to Bigtable
  –  Mentioned during LADIS 2009 talk
•  Runs user code within each region of a
   table
  –  Code split and moves with region
•  Defines high level call interface for clients
•  Calls addressed to rows or ranges of rows
•  Implicit automatic scaling, load balancing,
   and request routing
Examples Use-Cases

•  Bigtable uses Coprocessors
  –  Scalable metadata management
  –  Distributed language model for machine
     translation
  –  Distributed query processing for full-text index
  –  Regular expression search in code repository
•  MapReduce jobs over HBase are often map-
   only jobs
  –  Row keys are already sorted and distinct
  ➜ Could be replaced by Coprocessors
HBase Coprocessors
•  Inspired by Google’s Coprocessors
   –  Not much information available, but general idea is
      understood
•  Define various types of server-side code
   extensions
   –  Associated with table using a table property
   –  Attribute is a path to JAR file
   –  JAR is loaded when region is opened
   –  Blends new functionality with existing
•  Can be chained with Priorities and Load Order

➜ Allows for dynamic RPC extensions
Coprocessor Classes and Interfaces

•  The Coprocessor Interface
  –  All user code must inherit from this class
•  The CoprocessorEnvironment Interface
  –  Retains state across invocations
  –  Predefined classes
•  The CoprocessorHost Interface
  –  Ties state and user code together
  –  Predefined classes
Coprocessor Priority

•  System or User


/** Highest installation priority */
static final int PRIORITY_HIGHEST = 0;
/** High (system) installation priority */
static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4;
/** Default installation prio for user coprocessors */
static final int PRIORITY_USER = Integer.MAX_VALUE / 2;
/** Lowest installation priority */
static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
Coprocessor Environment

•  Available Methods
Coprocessor Host

•  Maintains all Coprocessor instances and
   their environments (state)
•  Concrete Classes
  –  MasterCoprocessorHost
  –  RegionCoprocessorHost
  –  WALCoprocessorHost
•  Subclasses provide access to specialized
   Environment implementations
Control Flow
Coprocessor Interface

•  Base for all other types of Coprocessors
•  start() and stop() methods for lifecycle
   management
•  State as defined in the interface:
Observer Classes

•  Comparable to database triggers
  –  Callback functions/hooks for every explicit API
     method, but also all important internal calls
•  Concrete Implementations
  –  MasterObserver
     •  Hooks into HMaster API
  –  RegionObserver
     •  Hooks into Region related operations
  –  WALObserver
     •  Hooks into write-ahead log operations
Region Observers

•  Can mediate (veto) actions
  –  Used by the security policy extensions
  –  Priority allows mediators to run first
•  Hooks into all CRUD+S API calls and more
  –  get(), put(), delete(), scan(), increment(),…
  –  checkAndPut(), checkAndDelete(),…
  –  flush(), compact(), split(),…
•  Pre/Post Hooks for every call
•  Can be used to build secondary indexes,
   filters
Endpoint Classes

•  Define a dynamic RPC protocol, used
   between client and region server
•  Executes arbitrary code, loaded in region
   server
  –  Future development will add code weaving/
     inspection to deny any malicious code
•  Steps to add your own methods
  –  Define and implement your own protocol
  –  Implement endpoint coprocessor
  –  Call HTable’s coprocessorExec() or
     coprocessorProxy()
Coprocessor Loading

•  There are two ways: dynamic or static
  –  Static: use configuration files and table schema
  –  Dynamic: not available (yet)
•  For static loading from configuration:
  –  Order is important (defines the execution order)
  –  Special property key for each host type
  –  Region related classes are loaded for all regions
     and tables
  –  Priority is always System
  –  JAR must be on class path
Loading from Configuration

•  Example:
  <property>!
    <name>hbase.coprocessor.region.classes</name> !
    <value>coprocessor.RegionObserverExample, !
      coprocessor.AnotherCoprocessor</value>!
  </property>

  <property> !
    <name>hbase.coprocessor.master.classes</name> !
    <value>coprocessor.MasterObserverExample</value>!
  </property>

  <property> !
    <name>hbase.coprocessor.wal.classes</name> !
    <value>coprocessor.WALObserverExample, !
      bar.foo.MyWALObserver</value> !
  </property> !
  !
Coprocessor Loading (cont.)

•  For static loading from table schema:
  –  Definition per table
  –  For all regions of the table
  –  Only region related classes, not WAL or Master
  –  Added to HTableDescriptor, when table is created
     or altered
  –  Allows to set the priority and JAR path
  COPROCESSOR$<num> ➜ !
      <path-to-jar>|<classname>|<priority> !
Loading from Table Schema

•  Example:

'COPROCESSOR$1' =>  !
  'hdfs://localhost:8020/users/leon/test.jar| !
   coprocessor.Test|10' !
!
'COPROCESSOR$2' =>  !
  '/Users/laura/test2.jar| !
   coprocessor.AnotherTest|1000' !
!
Example: Add Coprocessor
public static void main(String[] args) throws IOException { !
  Configuration conf = HBaseConfiguration.create(); !
  FileSystem fs = FileSystem.get(conf);

  Path path = new Path(fs.getUri() + Path.SEPARATOR +!
    "test.jar"); !
  HTableDescriptor htd = new HTableDescriptor("testtable");!
  htd.addFamily(new HColumnDescriptor("colfam1"));!
  htd.setValue("COPROCESSOR$1", path.toString() +!
    "|" + RegionObserverExample.class.getCanonicalName() +!
    "|" + Coprocessor.PRIORITY_USER); !
  HBaseAdmin admin = new HBaseAdmin(conf);!
  admin.createTable(htd); !
  System.out.println(admin.getTableDescriptor(!
    Bytes.toBytes("testtable"))); !
} !
Example Output
{NAME => 'testtable', COPROCESSOR$1 =>!
'file:/test.jar|coprocessor.RegionObserverExample|
1073741823', FAMILIES => [{NAME => 'colfam1',
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',
COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
'false', BLOCKCACHE => 'true'}]} !
!
Region Observers

•  Handles all region related events
•  Hooks for two classes of operations:
  –  Lifecycle changes
  –  Client API Calls
•  All client API calls have a pre/post hook
  –  Can be used to grant access on preGet()
  –  Can be used to update secondary indexes on
     postPut()
Handling Region Lifecycle Events




•  Hook into pending open, open, and pending
   close state changes
•  Called implicitly by the framework
  –  preOpen(), postOpen(),…
•  Used to piggyback or fail the process, e.g.
  –  Cache warm up after a region opens
  –  Suppress region splitting, compactions, flushes
Region Environment
Special Hook Parameter
public interface RegionObserver extends Coprocessor {!
!
  /**!
   * Called before the region is reported as open to the master.!
   * @param c the environment provided by the region server!
   */!
  void preOpen(final!
    ObserverContext<RegionCoprocessorEnvironment> c);!
!
  /**!
   * Called after the region is reported as open to the master.!
   * @param c the environment provided by the region server!
   */!
  void postOpen(final !
    ObserverContext<RegionCoprocessorEnvironment> c);!
!
ObserverContext
Chain of Command

•  Especially the complete() and bypass()
   methods allow to change the processing
   chain
  –  complete() ends the chain at the current
     coprocessor
  –  bypass() completes the pre/post chain but
     uses the last value returned by the
     coprocessors, possibly not calling the actual
     API method (for pre-hooks)
Example: Pre-Hook Complete



@Override !
public void preSplit(ObserverContext!
       <RegionCoprocessorEnvironment> e) {!
   e.complete(); !
}!
Master Observer

•  Handles all HMaster related events
  –  DDL type calls, e.g. create table, add column
  –  Region management calls, e.g. move, assign
•  Pre/post hooks with Context
•  Specialized environment provided
Master Environment
Master Services (cont.)

•  Very powerful features
  –  Access the AssignmentManager to modify
     plans
  –  Access the MasterFileSystem to create or
     access resources on HDFS
  –  Access the ServerManager to get the list of
     known servers
  –  Use the ExecutorService to run system-wide
     background processes
•  Be careful (for now)!
Example: Master Post Hook
public class MasterObserverExample !
  extends BaseMasterObserver { !
  @Override public void postCreateTable( !
     ObserverContext<MasterCoprocessorEnvironment> env, !
     HRegionInfo[] regions, boolean sync) !
     throws IOException { !
     String tableName = !
       regions[0].getTableDesc().getNameAsString(); !
     MasterServices services =!
       env.getEnvironment().getMasterServices();!
     MasterFileSystem masterFileSystem =!
      services.getMasterFileSystem(); !
     FileSystem fileSystem = masterFileSystem.getFileSystem();!
     Path blobPath = new Path(tableName + "-blobs");!
     fileSystem.mkdirs(blobPath); !
  }!
} !
!
Example Output

 hbase(main):001:0> create
   'testtable', 'colfam1‘!
 0 row(s) in 0.4300 seconds !
 !
 $ bin/hadoop dfs -ls

   Found 1 items

   drwxr-xr-x - larsgeorge
   supergroup 0 ... /user/
   larsgeorge/testtable-blobs !
Endpoints

•  Dynamic RPC extends server-side
   functionality
  –  Useful for MapReduce like implementations
  –  Handles the Map part server-side, Reduce needs
     to be done client side
•  Based on CoprocessorProtocol interface
•  Routing to regions is based on either single
   row keys, or row key ranges
  –  Call is sent, no matter if row exists or not since
     region start and end keys are coarse grained
Custom Endpoint Implementation

•  Involves two steps:
  –  Extend the CoprocessorProtocol interface
     •  Defines the actual protocol
  –  Extend the BaseEndpointCoprocessor
     •  Provides the server-side code and the dynamic
        RPC method
Example: Row Count Protocol

public interface RowCountProtocol!
  extends CoprocessorProtocol {!
  long getRowCount() !
    throws IOException; !
  long getRowCount(Filter filter)!
    throws IOException; !
  long getKeyValueCount() !
    throws IOException; !
} !
!
Example: Endpoint for Row Count
public class RowCountEndpoint !
extends BaseEndpointCoprocessor !
implements RowCountProtocol { !
!
  private long getCount(Filter filter, !
    boolean countKeyValues) throws IOException {

  Scan scan = new Scan();!
    scan.setMaxVersions(1); !
    if (filter != null) { !
      scan.setFilter(filter); !
    } !
Example: Endpoint for Row Count
  RegionCoprocessorEnvironment environment = !
    (RegionCoprocessorEnvironment)!
    getEnvironment();!
  // use an internal scanner to perform!
  // scanning.!
  InternalScanner scanner =!
    environment.getRegion().getScanner(scan); !
  int result = 0;!
Example: Endpoint for Row Count
      try { !
        List<KeyValue> curVals = !
          new ArrayList<KeyValue>(); !
        boolean done = false;!
        do { !
          curVals.clear(); !
          done = scanner.next(curVals); !
          result += countKeyValues ? curVals.size() : 1; !
        } while (done); !
      } finally { !
        scanner.close(); !
      } !
      return result; !
    } !
!
Example: Endpoint for Row Count
        @Override!
        public long getRowCount() throws IOException {!
          return getRowCount(new FirstKeyOnlyFilter()); !
        } !
!
        @Override !
        public long getRowCount(Filter filter) throws IOException {!
         return getCount(filter, false); !
        } !
!
        @Override!
        public long getKeyValueCount() throws IOException {!
          return getCount(null, true); !
        } !
}

        !
    !
!
Endpoint Invocation

•  There are two ways to invoke the call
  –  By Proxy, using HTable.coprocessorProxy()
     •  Uses a delayed model, i.e. the call is send when the
        proxied method is invoked
  –  By Exec, using HTable.coprocessorExec()
     •  The call is send in parallel to all regions and the results
        are collected immediately
•  The Batch.Call class is used be
   coprocessorExec() to wrap the calls per
   region
•  The optional Batch.Callback can be used to
   react upon completion of the remote call
Exec vs. Proxy
Example: Invocation by Exec

public static void main(String[] args) throws IOException { !
  Configuration conf = HBaseConfiguration.create(); !
  HTable table = new HTable(conf, "testtable");!
  try { !
    Map<byte[], Long> results = !
       table.coprocessorExec(RowCountProtocol.class, null, null,!
       new Batch.Call<RowCountProtocol, Long>() { !
         @Override!
         public Long call(RowCountProtocol counter) !
         throws IOException { !
           return counter.getRowCount(); !
         } !
       }); !
     !
Example: Invocation by Exec
       long total = 0;!
       for (Map.Entry<byte[], Long> entry : !
            results.entrySet()) { !
         total += entry.getValue().longValue();!
         System.out.println("Region: " + !
           Bytes.toString(entry.getKey()) +!
           ", Count: " + entry.getValue()); !
    } !
    System.out.println("Total Count: " + total); !
  } catch (Throwable throwable) { !
      throwable.printStackTrace(); !
  } !
} !
Example Output

Region: testtable,,
  1303417572005.51f9e2251c...cbcb
  0c66858f., Count: 2 !
Region: testtable,row3,
  1303417572005.7f3df4dcba...dbc9
  9fce5d87., Count: 3 !
Total Count: 5 !
!
Batch Convenience

•  The Batch.forMethod() helps to quickly
   map a protocol function into a Batch.Call
•  Useful for single method calls to the
   servers
•  Uses the Java reflection API to retrieve the
   named method
•  Saves you from implementing the
   anonymous inline class
Batch Convenience

    Batch.Call call =!
      Batch.forMethod(!
        RowCountProtocol.class,!
        "getKeyValueCount"); !
    Map<byte[], Long> results =!
      table.coprocessorExec(!
        RowCountProtocol.class, !
        null, null, call); !
    !
Call Multiple Endpoints

•  Sometimes you need to call more than
   one endpoint in a single roundtrip call to
   the servers
•  This requires an anonymous inline class,
   since Batch.forMethod cannot handle this
Call Multiple Endpoints

   Map<byte[], Pair<Long, Long>> !
   results = table.coprocessorExec( !
     RowCountProtocol.class, null, null,!
     new Batch.Call<RowCountProtocol,!
       Pair<Long, Long>>() { !
       public Pair<Long, Long> call(!
          RowCountProtocol counter) !
       throws IOException {

          return new Pair(!
           counter.getRowCount(), !
           counter.getKeyValueCount()); !
       }!
     }); !
Example: Invocation by Proxy


   RowCountProtocol protocol =!
     table.coprocessorProxy(!
       RowCountProtocol.class,!
       Bytes.toBytes("row4")); !
   long rowsInRegion =!
     protocol.getRowCount(); !
     System.out.println(!
       "Region Row Count: " +!
       rowsInRegion); !
   !
50    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
     Reproduction or redistribution without written permission is
                             prohibited.

More Related Content

PDF
Meet HBase 1.0
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
PDF
HBaseCon 2015: HBase Operations at Xiaomi
PDF
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
PPTX
Cross-Site BigTable using HBase
PPTX
HBaseCon 2015: HBase 2.0 and Beyond Panel
PPTX
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Meet HBase 1.0
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
Cross-Site BigTable using HBase
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

What's hot (20)

PDF
HBaseCon 2015: Elastic HBase on Mesos
PPTX
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
PPTX
HBase: Where Online Meets Low Latency
PDF
HBase 0.20.0 Performance Evaluation
PDF
Tales from the Cloudera Field
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PDF
The State of HBase Replication
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
PPT
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
PPTX
Rigorous and Multi-tenant HBase Performance Measurement
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PPTX
HBase Accelerated: In-Memory Flush and Compaction
PDF
HBase: Extreme Makeover
PPTX
Off-heaping the Apache HBase Read Path
PPTX
Apache HBase Performance Tuning
PPTX
Meet hbase 2.0
PPTX
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBase: Where Online Meets Low Latency
HBase 0.20.0 Performance Evaluation
Tales from the Cloudera Field
HBase and HDFS: Understanding FileSystem Usage in HBase
The State of HBase Replication
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Rigorous and Multi-tenant HBase Performance Measurement
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBase Accelerated: In-Memory Flush and Compaction
HBase: Extreme Makeover
Off-heaping the Apache HBase Read Path
Apache HBase Performance Tuning
Meet hbase 2.0
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Apache HBase, Accelerated: In-Memory Flush and Compaction
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Ad

Viewers also liked (20)

PPTX
HBaseCon 2013: A Developer’s Guide to Coprocessors
PDF
HBase, crazy dances on the elephant back.
PPTX
HBase Coprocessor Introduction
PDF
Hindex: Secondary indexes for faster HBase queries
PPTX
HBase Secondary Indexing
PPTX
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
PPTX
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
PPTX
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
PPTX
HBaseCon 2013: 1500 JIRAs in 20 Minutes
PDF
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
PPTX
HBaseCon 2012 | Scaling GIS In Three Acts
PPTX
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
PPTX
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
PPT
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
PPTX
HBaseCon 2013: Being Smarter Than the Smart Meter
PPTX
HBaseCon 2013: Apache HBase on Flash
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
PPTX
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
PPT
HBaseCon 2012 | Building Mobile Infrastructure with HBase
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2013: A Developer’s Guide to Coprocessors
HBase, crazy dances on the elephant back.
HBase Coprocessor Introduction
Hindex: Secondary indexes for faster HBase queries
HBase Secondary Indexing
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Apache HBase on Flash
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Ad

Similar to HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on the Cluster - Cloudera (20)

PPTX
Nov. 4, 2011 o reilly webcast-hbase- lars george
PPTX
Infrastructure modeling with chef
PPTX
H base introduction & development
PPTX
Hadoop 20111117
PDF
Postgres Vienna DB Meetup 2014
PPTX
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
PDF
(ATS4-PLAT01) Core Architecture Changes in AEP 9.0 and their Impact on Admini...
PDF
Rapid API Development ArangoDB Foxx
PDF
The Future of Apache Storm
PDF
Beyond 'Set it and Forget it': Proactively managing your EZproxy server
ODP
Configuration management with Chef
PPT
Hibernate java and_oracle
PPTX
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
PDF
Java colombo-deep-dive-into-jax-rs
PPTX
Meet HBase 2.0
PPTX
Meet Apache HBase - 2.0
PPTX
Hortonworks HBase Meetup Presentation
PDF
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
PPTX
Introduction to Business Processes 3.7
PDF
HBase Coprocessors @ HUG NYC
Nov. 4, 2011 o reilly webcast-hbase- lars george
Infrastructure modeling with chef
H base introduction & development
Hadoop 20111117
Postgres Vienna DB Meetup 2014
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
(ATS4-PLAT01) Core Architecture Changes in AEP 9.0 and their Impact on Admini...
Rapid API Development ArangoDB Foxx
The Future of Apache Storm
Beyond 'Set it and Forget it': Proactively managing your EZproxy server
Configuration management with Chef
Hibernate java and_oracle
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Java colombo-deep-dive-into-jax-rs
Meet HBase 2.0
Meet Apache HBase - 2.0
Hortonworks HBase Meetup Presentation
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Introduction to Business Processes 3.7
HBase Coprocessors @ HUG NYC

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on the Cluster - Cloudera

  • 1. HBaseCon, May 2012 HBase Coprocessors Lars George | Solutions Architect
  • 2. Revision History Version Revised By Description of Revision Version 1 Lars George Initial version 2 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 3. Overview •  Coprocessors were added to Bigtable –  Mentioned during LADIS 2009 talk •  Runs user code within each region of a table –  Code split and moves with region •  Defines high level call interface for clients •  Calls addressed to rows or ranges of rows •  Implicit automatic scaling, load balancing, and request routing
  • 4. Examples Use-Cases •  Bigtable uses Coprocessors –  Scalable metadata management –  Distributed language model for machine translation –  Distributed query processing for full-text index –  Regular expression search in code repository •  MapReduce jobs over HBase are often map- only jobs –  Row keys are already sorted and distinct ➜ Could be replaced by Coprocessors
  • 5. HBase Coprocessors •  Inspired by Google’s Coprocessors –  Not much information available, but general idea is understood •  Define various types of server-side code extensions –  Associated with table using a table property –  Attribute is a path to JAR file –  JAR is loaded when region is opened –  Blends new functionality with existing •  Can be chained with Priorities and Load Order ➜ Allows for dynamic RPC extensions
  • 6. Coprocessor Classes and Interfaces •  The Coprocessor Interface –  All user code must inherit from this class •  The CoprocessorEnvironment Interface –  Retains state across invocations –  Predefined classes •  The CoprocessorHost Interface –  Ties state and user code together –  Predefined classes
  • 7. Coprocessor Priority •  System or User /** Highest installation priority */ static final int PRIORITY_HIGHEST = 0; /** High (system) installation priority */ static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4; /** Default installation prio for user coprocessors */ static final int PRIORITY_USER = Integer.MAX_VALUE / 2; /** Lowest installation priority */ static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
  • 9. Coprocessor Host •  Maintains all Coprocessor instances and their environments (state) •  Concrete Classes –  MasterCoprocessorHost –  RegionCoprocessorHost –  WALCoprocessorHost •  Subclasses provide access to specialized Environment implementations
  • 11. Coprocessor Interface •  Base for all other types of Coprocessors •  start() and stop() methods for lifecycle management •  State as defined in the interface:
  • 12. Observer Classes •  Comparable to database triggers –  Callback functions/hooks for every explicit API method, but also all important internal calls •  Concrete Implementations –  MasterObserver •  Hooks into HMaster API –  RegionObserver •  Hooks into Region related operations –  WALObserver •  Hooks into write-ahead log operations
  • 13. Region Observers •  Can mediate (veto) actions –  Used by the security policy extensions –  Priority allows mediators to run first •  Hooks into all CRUD+S API calls and more –  get(), put(), delete(), scan(), increment(),… –  checkAndPut(), checkAndDelete(),… –  flush(), compact(), split(),… •  Pre/Post Hooks for every call •  Can be used to build secondary indexes, filters
  • 14. Endpoint Classes •  Define a dynamic RPC protocol, used between client and region server •  Executes arbitrary code, loaded in region server –  Future development will add code weaving/ inspection to deny any malicious code •  Steps to add your own methods –  Define and implement your own protocol –  Implement endpoint coprocessor –  Call HTable’s coprocessorExec() or coprocessorProxy()
  • 15. Coprocessor Loading •  There are two ways: dynamic or static –  Static: use configuration files and table schema –  Dynamic: not available (yet) •  For static loading from configuration: –  Order is important (defines the execution order) –  Special property key for each host type –  Region related classes are loaded for all regions and tables –  Priority is always System –  JAR must be on class path
  • 16. Loading from Configuration •  Example: <property>! <name>hbase.coprocessor.region.classes</name> ! <value>coprocessor.RegionObserverExample, ! coprocessor.AnotherCoprocessor</value>! </property>
 <property> ! <name>hbase.coprocessor.master.classes</name> ! <value>coprocessor.MasterObserverExample</value>! </property>
 <property> ! <name>hbase.coprocessor.wal.classes</name> ! <value>coprocessor.WALObserverExample, ! bar.foo.MyWALObserver</value> ! </property> ! !
  • 17. Coprocessor Loading (cont.) •  For static loading from table schema: –  Definition per table –  For all regions of the table –  Only region related classes, not WAL or Master –  Added to HTableDescriptor, when table is created or altered –  Allows to set the priority and JAR path COPROCESSOR$<num> ➜ ! <path-to-jar>|<classname>|<priority> !
  • 18. Loading from Table Schema •  Example: 'COPROCESSOR$1' => ! 'hdfs://localhost:8020/users/leon/test.jar| ! coprocessor.Test|10' ! ! 'COPROCESSOR$2' => ! '/Users/laura/test2.jar| ! coprocessor.AnotherTest|1000' ! !
  • 19. Example: Add Coprocessor public static void main(String[] args) throws IOException { ! Configuration conf = HBaseConfiguration.create(); ! FileSystem fs = FileSystem.get(conf);
 Path path = new Path(fs.getUri() + Path.SEPARATOR +! "test.jar"); ! HTableDescriptor htd = new HTableDescriptor("testtable");! htd.addFamily(new HColumnDescriptor("colfam1"));! htd.setValue("COPROCESSOR$1", path.toString() +! "|" + RegionObserverExample.class.getCanonicalName() +! "|" + Coprocessor.PRIORITY_USER); ! HBaseAdmin admin = new HBaseAdmin(conf);! admin.createTable(htd); ! System.out.println(admin.getTableDescriptor(! Bytes.toBytes("testtable"))); ! } !
  • 20. Example Output {NAME => 'testtable', COPROCESSOR$1 =>! 'file:/test.jar|coprocessor.RegionObserverExample| 1073741823', FAMILIES => [{NAME => 'colfam1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} ! !
  • 21. Region Observers •  Handles all region related events •  Hooks for two classes of operations: –  Lifecycle changes –  Client API Calls •  All client API calls have a pre/post hook –  Can be used to grant access on preGet() –  Can be used to update secondary indexes on postPut()
  • 22. Handling Region Lifecycle Events •  Hook into pending open, open, and pending close state changes •  Called implicitly by the framework –  preOpen(), postOpen(),… •  Used to piggyback or fail the process, e.g. –  Cache warm up after a region opens –  Suppress region splitting, compactions, flushes
  • 24. Special Hook Parameter public interface RegionObserver extends Coprocessor {! ! /**! * Called before the region is reported as open to the master.! * @param c the environment provided by the region server! */! void preOpen(final! ObserverContext<RegionCoprocessorEnvironment> c);! ! /**! * Called after the region is reported as open to the master.! * @param c the environment provided by the region server! */! void postOpen(final ! ObserverContext<RegionCoprocessorEnvironment> c);! !
  • 26. Chain of Command •  Especially the complete() and bypass() methods allow to change the processing chain –  complete() ends the chain at the current coprocessor –  bypass() completes the pre/post chain but uses the last value returned by the coprocessors, possibly not calling the actual API method (for pre-hooks)
  • 27. Example: Pre-Hook Complete @Override ! public void preSplit(ObserverContext! <RegionCoprocessorEnvironment> e) {! e.complete(); ! }!
  • 28. Master Observer •  Handles all HMaster related events –  DDL type calls, e.g. create table, add column –  Region management calls, e.g. move, assign •  Pre/post hooks with Context •  Specialized environment provided
  • 30. Master Services (cont.) •  Very powerful features –  Access the AssignmentManager to modify plans –  Access the MasterFileSystem to create or access resources on HDFS –  Access the ServerManager to get the list of known servers –  Use the ExecutorService to run system-wide background processes •  Be careful (for now)!
  • 31. Example: Master Post Hook public class MasterObserverExample ! extends BaseMasterObserver { ! @Override public void postCreateTable( ! ObserverContext<MasterCoprocessorEnvironment> env, ! HRegionInfo[] regions, boolean sync) ! throws IOException { ! String tableName = ! regions[0].getTableDesc().getNameAsString(); ! MasterServices services =! env.getEnvironment().getMasterServices();! MasterFileSystem masterFileSystem =! services.getMasterFileSystem(); ! FileSystem fileSystem = masterFileSystem.getFileSystem();! Path blobPath = new Path(tableName + "-blobs");! fileSystem.mkdirs(blobPath); ! }! } ! !
  • 32. Example Output hbase(main):001:0> create 'testtable', 'colfam1‘! 0 row(s) in 0.4300 seconds ! ! $ bin/hadoop dfs -ls
 Found 1 items
 drwxr-xr-x - larsgeorge supergroup 0 ... /user/ larsgeorge/testtable-blobs !
  • 33. Endpoints •  Dynamic RPC extends server-side functionality –  Useful for MapReduce like implementations –  Handles the Map part server-side, Reduce needs to be done client side •  Based on CoprocessorProtocol interface •  Routing to regions is based on either single row keys, or row key ranges –  Call is sent, no matter if row exists or not since region start and end keys are coarse grained
  • 34. Custom Endpoint Implementation •  Involves two steps: –  Extend the CoprocessorProtocol interface •  Defines the actual protocol –  Extend the BaseEndpointCoprocessor •  Provides the server-side code and the dynamic RPC method
  • 35. Example: Row Count Protocol public interface RowCountProtocol! extends CoprocessorProtocol {! long getRowCount() ! throws IOException; ! long getRowCount(Filter filter)! throws IOException; ! long getKeyValueCount() ! throws IOException; ! } ! !
  • 36. Example: Endpoint for Row Count public class RowCountEndpoint ! extends BaseEndpointCoprocessor ! implements RowCountProtocol { ! ! private long getCount(Filter filter, ! boolean countKeyValues) throws IOException {
 Scan scan = new Scan();! scan.setMaxVersions(1); ! if (filter != null) { ! scan.setFilter(filter); ! } !
  • 37. Example: Endpoint for Row Count RegionCoprocessorEnvironment environment = ! (RegionCoprocessorEnvironment)! getEnvironment();! // use an internal scanner to perform! // scanning.! InternalScanner scanner =! environment.getRegion().getScanner(scan); ! int result = 0;!
  • 38. Example: Endpoint for Row Count try { ! List<KeyValue> curVals = ! new ArrayList<KeyValue>(); ! boolean done = false;! do { ! curVals.clear(); ! done = scanner.next(curVals); ! result += countKeyValues ? curVals.size() : 1; ! } while (done); ! } finally { ! scanner.close(); ! } ! return result; ! } ! !
  • 39. Example: Endpoint for Row Count @Override! public long getRowCount() throws IOException {! return getRowCount(new FirstKeyOnlyFilter()); ! } ! ! @Override ! public long getRowCount(Filter filter) throws IOException {! return getCount(filter, false); ! } ! ! @Override! public long getKeyValueCount() throws IOException {! return getCount(null, true); ! } ! }
 ! ! !
  • 40. Endpoint Invocation •  There are two ways to invoke the call –  By Proxy, using HTable.coprocessorProxy() •  Uses a delayed model, i.e. the call is send when the proxied method is invoked –  By Exec, using HTable.coprocessorExec() •  The call is send in parallel to all regions and the results are collected immediately •  The Batch.Call class is used be coprocessorExec() to wrap the calls per region •  The optional Batch.Callback can be used to react upon completion of the remote call
  • 42. Example: Invocation by Exec public static void main(String[] args) throws IOException { ! Configuration conf = HBaseConfiguration.create(); ! HTable table = new HTable(conf, "testtable");! try { ! Map<byte[], Long> results = ! table.coprocessorExec(RowCountProtocol.class, null, null,! new Batch.Call<RowCountProtocol, Long>() { ! @Override! public Long call(RowCountProtocol counter) ! throws IOException { ! return counter.getRowCount(); ! } ! }); ! !
  • 43. Example: Invocation by Exec long total = 0;! for (Map.Entry<byte[], Long> entry : ! results.entrySet()) { ! total += entry.getValue().longValue();! System.out.println("Region: " + ! Bytes.toString(entry.getKey()) +! ", Count: " + entry.getValue()); ! } ! System.out.println("Total Count: " + total); ! } catch (Throwable throwable) { ! throwable.printStackTrace(); ! } ! } !
  • 44. Example Output Region: testtable,, 1303417572005.51f9e2251c...cbcb 0c66858f., Count: 2 ! Region: testtable,row3, 1303417572005.7f3df4dcba...dbc9 9fce5d87., Count: 3 ! Total Count: 5 ! !
  • 45. Batch Convenience •  The Batch.forMethod() helps to quickly map a protocol function into a Batch.Call •  Useful for single method calls to the servers •  Uses the Java reflection API to retrieve the named method •  Saves you from implementing the anonymous inline class
  • 46. Batch Convenience Batch.Call call =! Batch.forMethod(! RowCountProtocol.class,! "getKeyValueCount"); ! Map<byte[], Long> results =! table.coprocessorExec(! RowCountProtocol.class, ! null, null, call); ! !
  • 47. Call Multiple Endpoints •  Sometimes you need to call more than one endpoint in a single roundtrip call to the servers •  This requires an anonymous inline class, since Batch.forMethod cannot handle this
  • 48. Call Multiple Endpoints Map<byte[], Pair<Long, Long>> ! results = table.coprocessorExec( ! RowCountProtocol.class, null, null,! new Batch.Call<RowCountProtocol,! Pair<Long, Long>>() { ! public Pair<Long, Long> call(! RowCountProtocol counter) ! throws IOException {
 return new Pair(! counter.getRowCount(), ! counter.getKeyValueCount()); ! }! }); !
  • 49. Example: Invocation by Proxy RowCountProtocol protocol =! table.coprocessorProxy(! RowCountProtocol.class,! Bytes.toBytes("row4")); ! long rowsInRegion =! protocol.getRowCount(); ! System.out.println(! "Region Row Count: " +! rowsInRegion); ! !
  • 50. 50 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.