SlideShare a Scribd company logo
© 2018 Bloomberg Finance L.P. All rights reserved.
Coprocessors:
Uses, Abuses & Solutions
DataWorks Summit
June 20, 2018
Esther Kundin & Amit Anand
Senior Software Developers
© 2018 Bloomberg Finance L.P. All rights reserved.
About the Speakers
• Esther Kundin
• Senior Software Developer
• Lead architect and engineer
• Machine Learning and Text Analysis
• Open Source contributor
• Amit Anand
• Senior Software Developer
• Hadoop Services / Infrastructure team
• Deployments/Tooling
• Open Source contributor
© 2018 Bloomberg Finance L.P. All rights reserved.
Outline
• HBase Architecture Review
• Introduction to Coprocessors
• Coprocessor Uses
• Development Abuses and Solutions
• Deployment Abuses and Solutions
• Takeaways
• Q&A
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase Architecture Review
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase Concepts
• HBase is the Hadoop database, a distributed, scalable, big data store
• Keys (RowKeys)
• Regions
• Region Servers
Region 1
Keys 1, 2
Region 2
Keys 3, 4
Region 3
Keys 5,6
Region 4
Keys 7, 8
Region Server 1 Region Server 2
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase processes
• HMaster – table DDL operations
• Region Servers – data read/write operations
• Region Server with .META. table – maintains region with map of regions to
Region Server
• Outside Systems:
— Zookeeper – highly reliable distributed coordination
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase Read Overview
Client
HDFS
Region Server Region Server Region Server
Region Server
serving .META.
region
Client requests data
from the Region
Server, which gets
data from HDFS
Client determines
which Region Server
to contact and caches
that data
In-Memory Miss
Zookeeper
Client contacts ZK for address
of Region Server serving
.META. region
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase Write Overview
Region Server
WAL (on
HDFS)
MemStore
HFile
HFile
HFile
Region Server
persists write at the
end of the WAL
Regions Server
saves write in a
sorted map in
memory in the
MemStore
When MemStore reaches
a configurable size, it is
flushed to an HFile
© 2018 Bloomberg Finance L.P. All rights reserved.
Introduction to Coprocessors
© 2018 Bloomberg Finance L.P. All rights reserved.
What is a coprocessor?
• Custom code written by users, run as part of the HBase system
• Bundled Jar
• Loaded in HBase daemon JVM
• HMaster
• Region Server
Region Server – HBase code
Coprocessor Jar –
application code
© 2018 Bloomberg Finance L.P. All rights reserved.
Endpoint Coprocessor
• Similar to Stored Procedure – triggered by client API call
• Computation at data location
• Average
• Summation
• Use Protobuf to specify the input/output structure of your service
• Starting with HBase
• Must be invoked
• Table.CoprocessorService()
• Implementation of service in Java
• Run the coprocessor by calling that service to do the computation on the collocated data
Region Server – HBase code
Coprocessor Jar –
application code
Client
© 2018 Bloomberg Finance L.P. All rights reserved.
Observer Coprocessor
• Similar to trigger – triggered action happening in the daemon
• Hooks available for all HBase event types
• Canonical example - Can implement a security check on preGetOp/prePut
• Referential integrity – constraint on synthetic foreign keys
• Secondary indices – that’s how Phoenix does it
Region Server – HBase code
Coprocessor Jar –
application code
Client HDFS
Data Flow
© 2018 Bloomberg Finance L.P. All rights reserved.
Filters
• Similar to Observer Coprocessors for get/scan
• Can filter on any part of a row
• Many built-in filters available
• Can deploy custom filters
• Support for push-down predicates
• Different API
• Came earlier and much more limited
© 2018 Bloomberg Finance L.P. All rights reserved.
Filter Example
Snippet courtesy of https://github.com/larsgeorge/hbase-book/blob/master/ch04/src/main/java/filters/CustomFilterExample.java
© 2018 Bloomberg Finance L.P. All rights reserved.
Observer Coprocessor – Possible Triggers
• Region Server Observer
• preStopRegionServer
• preExecuteProcedures
• postExecuteProcedures
• preClearCompactionQueues
• postClearCompactionQueues
• Region Observer
• preGetOp
• postGetOp
• prePut
• postPut
WAL Observer
• preWALRoll
• preWALWrite
• postWALRoll
• postWALWrite
© 2018 Bloomberg Finance L.P. All rights reserved.
Observer Coprocessor – Possible Triggers
• Master Observer
• Runs in HBase master
• Create, Delete, Modify table
• Clone, Restore, Delete Snapshots
• Region splits and many more
© 2018 Bloomberg Finance L.P. All rights reserved.
Coprocessor Uses
© 2018 Bloomberg Finance L.P. All rights reserved.
Why use a coprocessor?
• Control behavior of Region Server
• Control behavior of data operations
• Server side processing
• Filters
• Aggregates
• Reduces pressure on the client side and network
• Less amount of data being sent from server
• NOT for complex data analysis
• Apache Phoenix (“We put the SQL back in NoSQL”)
© 2018 Bloomberg Finance L.P. All rights reserved.
Coprocessor Example
Post-Get example
Region Server
postGet
Key Col1 Col2 Col3 Col4 Col5
Key1Abc 1 4 5
Key1Def 2 2 2
Key1Xyz 10 11 12
Key1 Abc-
col1
Def-
col2
Abc-
col3
Abc-
col4
Xyz-
col5
Key1 1 2 4 5 12
Table Representation:
Coprocessor Result:
© 2018 Bloomberg Finance L.P. All rights reserved.
Apache Phoenix Coprocessors
• Apache Phoenix – OLTP and operational analytics for Apache Hadoop
• Phoenix jar runs on top of the Region Server
• Phoenix coprocessors map data model to raw bits and vice versa
• Coprocessor can ensure foreign key constraints
• Write to additional tables for global mutable secondary indices
• Server-side push down of many calculations – like filter, hash join
© 2018 Bloomberg Finance L.P. All rights reserved.
Development Abuses and Solutions
Application Developer Perspective
© 2018 Bloomberg Finance L.P. All rights reserved.
Challenge – Exceptions
• Exceptions (other than IOExceptions) in the coprocessor bring down Region Server
• 100% loss of service
• Manual intervention needed bring cluster back to health
• In other cases, the coprocessor silently unloads – depends on global settings
© 2018 Bloomberg Finance L.P. All rights reserved.
Solution – Catch all exception
public final void prePut(...)
throws IOException {
try {
prePutImpl(…);
}
catch(IOException ex) {
// Allow IOExceptions to propagate
// They won't cause an unload
throw ex;
}
catch(Throwable ex) {
// Wrap other exceptions as IOException
LOG.error("prePut: caught ", ex);
throw new IOException(ex);
}
}
Even better – create interface code for all coprocessors
© 2018 Bloomberg Finance L.P. All rights reserved.
Problem – Memory hog
• Memory is shared with Region Server memory and coprocessor memory
• Memory hogging slows down Region Server
© 2018 Bloomberg Finance L.P. All rights reserved.
Solutions – Defensive Java code
• Profile all coprocessor code for memory usage
• Use a generic profiler with a driver for your coprocessor
• i.e., JProfiler
• Use common Java tricks for limiting memory usage
• Use primitive types and underlying arrays where possible
• Use immutable objects
• StringBuilder vs. String concatenation
Logging and metrics tips
• Update log4j.properties file with a separate log parameter for coprocessors
• Use MDC (Mapped Diagnostic Context) context to pass parameters to all parts of the
coprocessor – across pre and post operations, for example
• http://www.slf4j.org/api/org/slf4j/MDC.html
• Create an extra column in a Result to pass back an object populated with metrics
© 2018 Bloomberg Finance L.P. All rights reserved.
Deployment Abuses and Solutions
Administrator Perspective
© 2018 Bloomberg Finance L.P. All rights reserved.
How to deploy a coprocessor?
• Known as coprocessor loading
• Static loading
• Dynamic loading
© 2018 Bloomberg Finance L.P. All rights reserved.
Static Loading
hbase-site.xml
hbase.coprocessor.region.classes
hbase.coprocessor.wal.classes
hbase.coprocessor.master.classes
Example:
<property>
<name>hbase.coprocessor.region.classes</name>
<value>com.bloomberg.hbase.coprocessor.endpoint.AverageEndPoint</value>
</property>
Property Type of Coprocessor
region.classes Region Observer
End Points
wal.classes Wal Observer
master.classes Master Observer
© 2018 Bloomberg Finance L.P. All rights reserved.
Static Loading
hbase-site.xml
RS1 RS2 RS3 RS4 RS5 RS6
© 2018 Bloomberg Finance L.P. All rights reserved.
Static Loading – Key points to remember
• Active on all Region Servers
— All the regions of all the tables
• Multiple co-processors can be deployed
— Provide a “,” separated list
• Update HBase’s class path
— In hbase-env.sh
— Copy jar to hbase/lib folder
• Must restart every Region Server
— hbase.dynamic.jars.dir is for filters
© 2018 Bloomberg Finance L.P. All rights reserved.
Static Unloading
hbase-site.xml
hbase.coprocessor.region.classes
hbase.coprocessor.wal.classes
hbase.coprocessor.master.classes
RS1 RS2 RS3 RS4 RS5 RS6
© 2018 Bloomberg Finance L.P. All rights reserved.
Dynamic Loading
• Modification to hbase-site.xml is not required
• Loading is per table basis
— Only available to the table they are loaded for
• Known as Table coprocessors
• Loaded via HBase shell
— Admin API via Java
• JAR is stored on shared location
— Usually HDFS
• Preferred method at Bloomberg
© 2018 Bloomberg Finance L.P. All rights reserved.
Dynamic Loading
HDFS
/hbase/coprocessors
coprocessor.jar MyRSObvserver.jar
Dynamic Loading - HBase shell – Admin/Developers
Image source: http://hbase.apache.org/book.htm
© 2018 Bloomberg Finance L.P. All rights reserved.
Dynamic Loading: Java API – Developers
Source: http://hbase.apache.org/book.htm
© 2018 Bloomberg Finance L.P. All rights reserved.
Dynamic Unloading: HBase shell – Admin/Developers
Source: http://hbase.apache.org/book.htm
© 2018 Bloomberg Finance L.P. All rights reserved.
Dynamic Unloading: Java API - Developers
Note: In HBase 0.96 and newer, you can instead use the removeCoprocessor() method of the HTableDescriptor class.
Source: http://hbase.apache.org/book.htm
© 2018 Bloomberg Finance L.P. All rights reserved.
Comparison
Description Static Loading Dynamic Loading
Changes to hbase-site.xml Yes No
Restart region servers Yes No
Jar location Local filesystem HDFS
Coprocessor availability Global Per table
Loaded via hbase shell No Yes
Loaded via java API No Yes
Read permissions HBase HBase
Management complexity High Low
© 2018 Bloomberg Finance L.P. All rights reserved.
Challenges
• Version compatibility can cause failures
• Rollout of non-backward-compatible coprocessor difficult
• Clean-up can be messy HBASE-14190 - Assign system tables ahead of user
region assignment
Solutions
• Compile code against version deployed on the cluster
• Keep your developers informed
• Work with developers
• Review their code
• Test version changes in development environment
© 2018 Bloomberg Finance L.P. All rights reserved.
Challenges
• User deployment is dangerous
• HDFS Permissions Changes
• Crashes entire cluster
• If hbase.coprocessor.abortonerror is set to true
• Bringing up cluster is challenging
• Requires manual intervention
• Missing jar will bring down the Region Server
ERROR
org.apache.hadoop.hbase.coprocessor.CoprocessorHost: The coprocessor
fooCoprocessor threw java.io.FileNotFoundException: File does not exist:
/path/to/corprocessor.jar
java.io.FileNotFoundException: File does not exist: /path/to/corprocessor.jar
Solutions
• Create a common shared directory under /hbase(/hbase/coprocessors)
• Use automation to deploy user coprocessor to shared location
© 2018 Bloomberg Finance L.P. All rights reserved.
Deployment Guidelines
• Enable user coprocessors
• Set hbase.coprocessor.user.enabled to true
• Phoenix coprocessors are treated as user coprocessor
• Set hbase.coprocessor.enabled to true
• Keeps system coprocessors enabled (AccessController)
• Enable coprocessor white listing (HBASE-16700)
• Set hbase.coprocessor.region.whitelist.paths
• Specify each directory individually
• Wildcards won’t include subdirectories (documentation says otherwise)
• Entire filesystem, i.e., hdfs:://Test-Laptop, won’t work (documentation says otherwise)
© 2018 Bloomberg Finance L.P. All rights reserved.
Deployment guidelines
• Create a bundled jar
• Includes all coprocessors
• Single entry in HBase classpath
• Use automation
• Chef/Puppet/Ansible
hbase.coprocessor.enabled hbase.coprocessor.user.enabled
hbase.coprocessor.regionserver.classes hbase.coprocessor.region.classes
hbase.coprocessor.user.region.classes hbase.coprocessor.master.classes
hbase.coprocessor.wal.classes hbase.coprocessor.abortonerror
hbase.coprocessor.region.whitelist.paths
© 2018 Bloomberg Finance L.P. All rights reserved.
Takeaways
© 2018 Bloomberg Finance L.P. All rights reserved.
Recap
• Coprocessors are necessary
• Phoenix
• HBase security
• User coprocessors are dangerous
• Write defensive code
• Be careful with deployment
• Make use of HBASE-16700
• Cleanup can be messy
• HBASE-14190 – Assign system tables ahead of user region assignment
© 2018 Bloomberg Finance L.P. All rights reserved.
Needed from the Community
• Story for coprocessor deployment
• Process isolation
• JMX metrics
© 2018 Bloomberg Finance L.P. All rights reserved.
Thank You!
Reference: http://hbase.apache.org
Chef Code: https://github.com/bloomberg/chef-bach.git
Connect with Hadoop Team: hadoop@bloomberg.net
© 2018 Bloomberg Finance L.P. All rights reserved.
We are hiring!
Questions?
https://www.bloomberg.com/careers

More Related Content

PDF
[215] Druid로 쉽고 빠르게 데이터 분석하기
PDF
Funnel Analysis with Apache Spark and Druid
PDF
Migrating to Apache Spark at Netflix
PDF
Data Streaming Ecosystem Management at Booking.com
PDF
Iceberg: a fast table format for S3
PDF
Apache Flink internals
PDF
Distributed tracing using open tracing &amp; jaeger 2
PDF
What is new in Apache Hive 3.0?
[215] Druid로 쉽고 빠르게 데이터 분석하기
Funnel Analysis with Apache Spark and Druid
Migrating to Apache Spark at Netflix
Data Streaming Ecosystem Management at Booking.com
Iceberg: a fast table format for S3
Apache Flink internals
Distributed tracing using open tracing &amp; jaeger 2
What is new in Apache Hive 3.0?

What's hot (20)

PDF
How Netflix Is Solving Authorization Across Their Cloud
PPTX
Service-mesh options with Linkerd, Consul, Istio and AWS AppMesh
PDF
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
PDF
Memory Management in Apache Spark
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
PPTX
Elastic Stack Introduction
PDF
Airflow introduction
PDF
Data and AI summit: data pipelines observability with open lineage
PPTX
iceberg introduction.pptx
PDF
실시간 이상탐지를 위한 머신러닝 모델에 Druid _ Imply 활용하기
PDF
Scaling Hadoop at LinkedIn
PDF
Getting Started with Databricks SQL Analytics
PPTX
Capture the Streams of Database Changes
PDF
Cloud DW technology trends and considerations for enterprises to apply snowflake
PPTX
Observability - Stockholm Splunk UG Jan 19 2023.pptx
PPTX
Microservices
PPTX
Difference between Github vs Gitlab vs Bitbucket
How Netflix Is Solving Authorization Across Their Cloud
Service-mesh options with Linkerd, Consul, Istio and AWS AppMesh
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
Memory Management in Apache Spark
Apache Iceberg - A Table Format for Hige Analytic Datasets
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Orchestrating workflows Apache Airflow on GCP & AWS
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Elastic Stack Introduction
Airflow introduction
Data and AI summit: data pipelines observability with open lineage
iceberg introduction.pptx
실시간 이상탐지를 위한 머신러닝 모델에 Druid _ Imply 활용하기
Scaling Hadoop at LinkedIn
Getting Started with Databricks SQL Analytics
Capture the Streams of Database Changes
Cloud DW technology trends and considerations for enterprises to apply snowflake
Observability - Stockholm Splunk UG Jan 19 2023.pptx
Microservices
Difference between Github vs Gitlab vs Bitbucket
Ad

Similar to HBase coprocessors, Uses, Abuses, Solutions (20)

PPTX
High throughput data replication over RAFT
PPTX
Spring-Boot-PQS with Apache Ignite Caching @ HbaseCon PhoenixCon Dataworks su...
PPTX
Meet HBase 2.0 and Phoenix-5.0
PDF
Multi-Tenant HBase Cluster - HBaseCon2018-final
PDF
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
PPTX
Advanced technologies and techniques for debugging HPC applications
PDF
Apache Ratis - In Search of a Usable Raft Library
POTX
Meet HBase 2.0 and Phoenix 5.0
PDF
Real-Time Market Data Analytics Using Kafka Streams
PDF
SD Times - Docker v2
PPTX
Apache Tez – Present and Future
PPTX
Apache Tez – Present and Future
PDF
HBase tales from the trenches
PDF
Using Databases and Containers From Development to Deployment
PPTX
How YugaByte DB Implements Distributed PostgreSQL
PPTX
Multi-Lingual Accumulo Communications
PPTX
An overview of reference architectures for Postgres
 
PDF
introduction to kubernetes slide deck by Roach
PPTX
The forgotten route: Making Apache Camel work for you
PPTX
Functions and DevOps
High throughput data replication over RAFT
Spring-Boot-PQS with Apache Ignite Caching @ HbaseCon PhoenixCon Dataworks su...
Meet HBase 2.0 and Phoenix-5.0
Multi-Tenant HBase Cluster - HBaseCon2018-final
국내 미디어 고객사의 AWS 활용 사례 - POOQ 서비스, 콘텐츠연합플랫폼::조휘열::AWS Summit Seoul 2018
Advanced technologies and techniques for debugging HPC applications
Apache Ratis - In Search of a Usable Raft Library
Meet HBase 2.0 and Phoenix 5.0
Real-Time Market Data Analytics Using Kafka Streams
SD Times - Docker v2
Apache Tez – Present and Future
Apache Tez – Present and Future
HBase tales from the trenches
Using Databases and Containers From Development to Deployment
How YugaByte DB Implements Distributed PostgreSQL
Multi-Lingual Accumulo Communications
An overview of reference architectures for Postgres
 
introduction to kubernetes slide deck by Roach
The forgotten route: Making Apache Camel work for you
Functions and DevOps
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Encapsulation theory and applications.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Getting Started with Data Integration: FME Form 101
Encapsulation theory and applications.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Assigned Numbers - 2025 - Bluetooth® Document
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25-Week II
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf

HBase coprocessors, Uses, Abuses, Solutions

  • 1. © 2018 Bloomberg Finance L.P. All rights reserved. Coprocessors: Uses, Abuses & Solutions DataWorks Summit June 20, 2018 Esther Kundin & Amit Anand Senior Software Developers
  • 2. © 2018 Bloomberg Finance L.P. All rights reserved. About the Speakers • Esther Kundin • Senior Software Developer • Lead architect and engineer • Machine Learning and Text Analysis • Open Source contributor • Amit Anand • Senior Software Developer • Hadoop Services / Infrastructure team • Deployments/Tooling • Open Source contributor
  • 3. © 2018 Bloomberg Finance L.P. All rights reserved. Outline • HBase Architecture Review • Introduction to Coprocessors • Coprocessor Uses • Development Abuses and Solutions • Deployment Abuses and Solutions • Takeaways • Q&A
  • 4. © 2018 Bloomberg Finance L.P. All rights reserved. HBase Architecture Review
  • 5. © 2018 Bloomberg Finance L.P. All rights reserved. HBase Concepts • HBase is the Hadoop database, a distributed, scalable, big data store • Keys (RowKeys) • Regions • Region Servers Region 1 Keys 1, 2 Region 2 Keys 3, 4 Region 3 Keys 5,6 Region 4 Keys 7, 8 Region Server 1 Region Server 2
  • 6. © 2018 Bloomberg Finance L.P. All rights reserved. HBase processes • HMaster – table DDL operations • Region Servers – data read/write operations • Region Server with .META. table – maintains region with map of regions to Region Server • Outside Systems: — Zookeeper – highly reliable distributed coordination
  • 7. © 2018 Bloomberg Finance L.P. All rights reserved. HBase Read Overview Client HDFS Region Server Region Server Region Server Region Server serving .META. region Client requests data from the Region Server, which gets data from HDFS Client determines which Region Server to contact and caches that data In-Memory Miss Zookeeper Client contacts ZK for address of Region Server serving .META. region
  • 8. © 2018 Bloomberg Finance L.P. All rights reserved. HBase Write Overview Region Server WAL (on HDFS) MemStore HFile HFile HFile Region Server persists write at the end of the WAL Regions Server saves write in a sorted map in memory in the MemStore When MemStore reaches a configurable size, it is flushed to an HFile
  • 9. © 2018 Bloomberg Finance L.P. All rights reserved. Introduction to Coprocessors
  • 10. © 2018 Bloomberg Finance L.P. All rights reserved. What is a coprocessor? • Custom code written by users, run as part of the HBase system • Bundled Jar • Loaded in HBase daemon JVM • HMaster • Region Server Region Server – HBase code Coprocessor Jar – application code
  • 11. © 2018 Bloomberg Finance L.P. All rights reserved. Endpoint Coprocessor • Similar to Stored Procedure – triggered by client API call • Computation at data location • Average • Summation • Use Protobuf to specify the input/output structure of your service • Starting with HBase • Must be invoked • Table.CoprocessorService() • Implementation of service in Java • Run the coprocessor by calling that service to do the computation on the collocated data Region Server – HBase code Coprocessor Jar – application code Client
  • 12. © 2018 Bloomberg Finance L.P. All rights reserved. Observer Coprocessor • Similar to trigger – triggered action happening in the daemon • Hooks available for all HBase event types • Canonical example - Can implement a security check on preGetOp/prePut • Referential integrity – constraint on synthetic foreign keys • Secondary indices – that’s how Phoenix does it Region Server – HBase code Coprocessor Jar – application code Client HDFS Data Flow
  • 13. © 2018 Bloomberg Finance L.P. All rights reserved. Filters • Similar to Observer Coprocessors for get/scan • Can filter on any part of a row • Many built-in filters available • Can deploy custom filters • Support for push-down predicates • Different API • Came earlier and much more limited
  • 14. © 2018 Bloomberg Finance L.P. All rights reserved. Filter Example Snippet courtesy of https://github.com/larsgeorge/hbase-book/blob/master/ch04/src/main/java/filters/CustomFilterExample.java
  • 15. © 2018 Bloomberg Finance L.P. All rights reserved. Observer Coprocessor – Possible Triggers • Region Server Observer • preStopRegionServer • preExecuteProcedures • postExecuteProcedures • preClearCompactionQueues • postClearCompactionQueues • Region Observer • preGetOp • postGetOp • prePut • postPut WAL Observer • preWALRoll • preWALWrite • postWALRoll • postWALWrite
  • 16. © 2018 Bloomberg Finance L.P. All rights reserved. Observer Coprocessor – Possible Triggers • Master Observer • Runs in HBase master • Create, Delete, Modify table • Clone, Restore, Delete Snapshots • Region splits and many more
  • 17. © 2018 Bloomberg Finance L.P. All rights reserved. Coprocessor Uses
  • 18. © 2018 Bloomberg Finance L.P. All rights reserved. Why use a coprocessor? • Control behavior of Region Server • Control behavior of data operations • Server side processing • Filters • Aggregates • Reduces pressure on the client side and network • Less amount of data being sent from server • NOT for complex data analysis • Apache Phoenix (“We put the SQL back in NoSQL”)
  • 19. © 2018 Bloomberg Finance L.P. All rights reserved. Coprocessor Example Post-Get example Region Server postGet Key Col1 Col2 Col3 Col4 Col5 Key1Abc 1 4 5 Key1Def 2 2 2 Key1Xyz 10 11 12 Key1 Abc- col1 Def- col2 Abc- col3 Abc- col4 Xyz- col5 Key1 1 2 4 5 12 Table Representation: Coprocessor Result:
  • 20. © 2018 Bloomberg Finance L.P. All rights reserved. Apache Phoenix Coprocessors • Apache Phoenix – OLTP and operational analytics for Apache Hadoop • Phoenix jar runs on top of the Region Server • Phoenix coprocessors map data model to raw bits and vice versa • Coprocessor can ensure foreign key constraints • Write to additional tables for global mutable secondary indices • Server-side push down of many calculations – like filter, hash join
  • 21. © 2018 Bloomberg Finance L.P. All rights reserved. Development Abuses and Solutions Application Developer Perspective
  • 22. © 2018 Bloomberg Finance L.P. All rights reserved. Challenge – Exceptions • Exceptions (other than IOExceptions) in the coprocessor bring down Region Server • 100% loss of service • Manual intervention needed bring cluster back to health • In other cases, the coprocessor silently unloads – depends on global settings
  • 23. © 2018 Bloomberg Finance L.P. All rights reserved. Solution – Catch all exception public final void prePut(...) throws IOException { try { prePutImpl(…); } catch(IOException ex) { // Allow IOExceptions to propagate // They won't cause an unload throw ex; } catch(Throwable ex) { // Wrap other exceptions as IOException LOG.error("prePut: caught ", ex); throw new IOException(ex); } } Even better – create interface code for all coprocessors
  • 24. © 2018 Bloomberg Finance L.P. All rights reserved. Problem – Memory hog • Memory is shared with Region Server memory and coprocessor memory • Memory hogging slows down Region Server
  • 25. © 2018 Bloomberg Finance L.P. All rights reserved. Solutions – Defensive Java code • Profile all coprocessor code for memory usage • Use a generic profiler with a driver for your coprocessor • i.e., JProfiler • Use common Java tricks for limiting memory usage • Use primitive types and underlying arrays where possible • Use immutable objects • StringBuilder vs. String concatenation
  • 26. Logging and metrics tips • Update log4j.properties file with a separate log parameter for coprocessors • Use MDC (Mapped Diagnostic Context) context to pass parameters to all parts of the coprocessor – across pre and post operations, for example • http://www.slf4j.org/api/org/slf4j/MDC.html • Create an extra column in a Result to pass back an object populated with metrics
  • 27. © 2018 Bloomberg Finance L.P. All rights reserved. Deployment Abuses and Solutions Administrator Perspective
  • 28. © 2018 Bloomberg Finance L.P. All rights reserved. How to deploy a coprocessor? • Known as coprocessor loading • Static loading • Dynamic loading
  • 29. © 2018 Bloomberg Finance L.P. All rights reserved. Static Loading hbase-site.xml hbase.coprocessor.region.classes hbase.coprocessor.wal.classes hbase.coprocessor.master.classes Example: <property> <name>hbase.coprocessor.region.classes</name> <value>com.bloomberg.hbase.coprocessor.endpoint.AverageEndPoint</value> </property> Property Type of Coprocessor region.classes Region Observer End Points wal.classes Wal Observer master.classes Master Observer
  • 30. © 2018 Bloomberg Finance L.P. All rights reserved. Static Loading hbase-site.xml RS1 RS2 RS3 RS4 RS5 RS6
  • 31. © 2018 Bloomberg Finance L.P. All rights reserved. Static Loading – Key points to remember • Active on all Region Servers — All the regions of all the tables • Multiple co-processors can be deployed — Provide a “,” separated list • Update HBase’s class path — In hbase-env.sh — Copy jar to hbase/lib folder • Must restart every Region Server — hbase.dynamic.jars.dir is for filters
  • 32. © 2018 Bloomberg Finance L.P. All rights reserved. Static Unloading hbase-site.xml hbase.coprocessor.region.classes hbase.coprocessor.wal.classes hbase.coprocessor.master.classes RS1 RS2 RS3 RS4 RS5 RS6
  • 33. © 2018 Bloomberg Finance L.P. All rights reserved. Dynamic Loading • Modification to hbase-site.xml is not required • Loading is per table basis — Only available to the table they are loaded for • Known as Table coprocessors • Loaded via HBase shell — Admin API via Java • JAR is stored on shared location — Usually HDFS • Preferred method at Bloomberg
  • 34. © 2018 Bloomberg Finance L.P. All rights reserved. Dynamic Loading HDFS /hbase/coprocessors coprocessor.jar MyRSObvserver.jar
  • 35. Dynamic Loading - HBase shell – Admin/Developers Image source: http://hbase.apache.org/book.htm
  • 36. © 2018 Bloomberg Finance L.P. All rights reserved. Dynamic Loading: Java API – Developers Source: http://hbase.apache.org/book.htm
  • 37. © 2018 Bloomberg Finance L.P. All rights reserved. Dynamic Unloading: HBase shell – Admin/Developers Source: http://hbase.apache.org/book.htm
  • 38. © 2018 Bloomberg Finance L.P. All rights reserved. Dynamic Unloading: Java API - Developers Note: In HBase 0.96 and newer, you can instead use the removeCoprocessor() method of the HTableDescriptor class. Source: http://hbase.apache.org/book.htm
  • 39. © 2018 Bloomberg Finance L.P. All rights reserved. Comparison Description Static Loading Dynamic Loading Changes to hbase-site.xml Yes No Restart region servers Yes No Jar location Local filesystem HDFS Coprocessor availability Global Per table Loaded via hbase shell No Yes Loaded via java API No Yes Read permissions HBase HBase Management complexity High Low
  • 40. © 2018 Bloomberg Finance L.P. All rights reserved. Challenges • Version compatibility can cause failures • Rollout of non-backward-compatible coprocessor difficult • Clean-up can be messy HBASE-14190 - Assign system tables ahead of user region assignment Solutions • Compile code against version deployed on the cluster • Keep your developers informed • Work with developers • Review their code • Test version changes in development environment
  • 41. © 2018 Bloomberg Finance L.P. All rights reserved. Challenges • User deployment is dangerous • HDFS Permissions Changes • Crashes entire cluster • If hbase.coprocessor.abortonerror is set to true • Bringing up cluster is challenging • Requires manual intervention • Missing jar will bring down the Region Server ERROR org.apache.hadoop.hbase.coprocessor.CoprocessorHost: The coprocessor fooCoprocessor threw java.io.FileNotFoundException: File does not exist: /path/to/corprocessor.jar java.io.FileNotFoundException: File does not exist: /path/to/corprocessor.jar Solutions • Create a common shared directory under /hbase(/hbase/coprocessors) • Use automation to deploy user coprocessor to shared location
  • 42. © 2018 Bloomberg Finance L.P. All rights reserved. Deployment Guidelines • Enable user coprocessors • Set hbase.coprocessor.user.enabled to true • Phoenix coprocessors are treated as user coprocessor • Set hbase.coprocessor.enabled to true • Keeps system coprocessors enabled (AccessController) • Enable coprocessor white listing (HBASE-16700) • Set hbase.coprocessor.region.whitelist.paths • Specify each directory individually • Wildcards won’t include subdirectories (documentation says otherwise) • Entire filesystem, i.e., hdfs:://Test-Laptop, won’t work (documentation says otherwise)
  • 43. © 2018 Bloomberg Finance L.P. All rights reserved. Deployment guidelines • Create a bundled jar • Includes all coprocessors • Single entry in HBase classpath • Use automation • Chef/Puppet/Ansible hbase.coprocessor.enabled hbase.coprocessor.user.enabled hbase.coprocessor.regionserver.classes hbase.coprocessor.region.classes hbase.coprocessor.user.region.classes hbase.coprocessor.master.classes hbase.coprocessor.wal.classes hbase.coprocessor.abortonerror hbase.coprocessor.region.whitelist.paths
  • 44. © 2018 Bloomberg Finance L.P. All rights reserved. Takeaways
  • 45. © 2018 Bloomberg Finance L.P. All rights reserved. Recap • Coprocessors are necessary • Phoenix • HBase security • User coprocessors are dangerous • Write defensive code • Be careful with deployment • Make use of HBASE-16700 • Cleanup can be messy • HBASE-14190 – Assign system tables ahead of user region assignment
  • 46. © 2018 Bloomberg Finance L.P. All rights reserved. Needed from the Community • Story for coprocessor deployment • Process isolation • JMX metrics
  • 47. © 2018 Bloomberg Finance L.P. All rights reserved. Thank You! Reference: http://hbase.apache.org Chef Code: https://github.com/bloomberg/chef-bach.git Connect with Hadoop Team: hadoop@bloomberg.net
  • 48. © 2018 Bloomberg Finance L.P. All rights reserved. We are hiring! Questions? https://www.bloomberg.com/careers

Editor's Notes

  • #13: Compared to stored procedure Has a service-like interface Use Protobuf to specify the input/output structure of your service Implementation of service in Java Run the coprocessor by calling that service to do the computation on the collocated data
  • #14: Compared to stored procedure Has a service-like interface Use Protobuf to specify the input/output structure of your service Implementation of service in Java Run the coprocessor by calling that service to do the computation on the collocated data
  • #15: Compared to stored procedure Has a service-like interface Use Protobuf to specify the input/output structure of your service Implementation of service in Java Run the coprocessor by calling that service to do the computation on the collocated data
  • #16: Compared to stored procedure Has a service-like interface Use Protobuf to specify the input/output structure of your service Implementation of service in Java Run the coprocessor by calling that service to do the computation on the collocated data
  • #23: Image from https://pixabay.com/en/exclamation-point-matter-requests-507770/
  • #24: https://pixabay.com/en/hog-pig-animal-barnyard-farm-152402/
  • #25: https://pixabay.com/en/hog-pig-animal-barnyard-farm-152402/