SlideShare a Scribd company logo
Email notifications from HBase data
Hadoop : HBase
Coprocessors and
Oozie jobs
Jinith Joseph
• Introduction to Hbase co processors and Oozie java jobs
• Notifications from HBase using HBase co processors & Oozie jobs
• Implementation on Cloudera cdh 5.4.0 without flume
Requirement
Scope of Coprocessors
• System Level : Coprocessors can be configured to work on every tables and regions
• Table Level : Coprocessors can be configured to work on all regions of a specific table
Types of Coprocessors
• Observer coprocessor ( Acts like triggers in conventional databases )
This allows users to insert custom code by overriding the upcall methods provided by the coprocessor framework. The callback
functions are executed from core HBase code when certain events occur.
• Endpoint coprocessor ( Resembled stored procedures in conventional databases )
User can invoke the end point at any time from the client, which will be executed remotely at the target or all regions and the
results will be returned to the client.
Overview of HBase Coprocessor
Observer Coprocessor
Three types of Observers
• Region Observer
These observers provide hooks to data manipulation events like get , put, delete, scan , etc on HBase tables.
For every table region, there will be an instance of RegionObserver coprocessor. But the scope of the
observers could be set to a specific region or across all regions.
• WAL Observer
Provides hooks for write-ahead log (WAL) related operations. These runs in the context of WAL
processing. This is used for WAL writing and reconstruction events (Eg: Bulk load of HBase tables using HFile).
• Master Observer
These observes provides hook on the DDL operations like create / alter / delete tables. The master observer
runs within the context of HBase master.
Oozie
Apache Oozie is a system for running workflows of dependent jobs, and contains two main
engines :
• workflow engine – This stores and runs workflows composed of different types of Hadoop jobs
• coordinator engine - This runs workflow jobs based on predefined schedules and data availability. It allows the
user to define and execute recurrent and interdependent workflow jobs
The Oozie web UI will be available in the URL : http://namdenode:11000
Using Oozie workflow different tasks can be setup up as a part of workflow including :
• Hive Script, Pig , Spark, Java, Sqoop, MapReduce, Shell, Ssh, HDFS Fs, Email, Streaming, Distcp, etc.
Oozie workflows generally follows a transition Diagram to report its status as :
Start Actions
End
Fail
Error
Success
How to Connect these?
HBase
Data TableData insertion to HBase
Coprocessor
Region Observer Coprocessor
postput() which will be invoked after
the insertion of the data in HBase
table has happened
HBase
Log Table Oozie job
Invoke Oozie job with parameters
on a different thread
Using 3rd Party .jars –
mail.jar and activation.jar
Frame the email content
from parameters
Using SMTP send out the
email to users.
* Might not be an ideal way to run in production environment. Please consider the rate of data and the threads invoked to run Oozie jobs
JAVA
JAVA
Region Observer Coprocessor – postput()
• The client code will be written in Java which would override the upcall methods of coprocessor framework
and implement Observer coprocessor, which will be initiated as soon as there is a data available in the HBase
table.
• The insertion of the data will consume the same time before any coprocessors are finished. So any bulk code
on the Coprocessor (overriding put events), will cost us on inserting data in the HBase tables. So here we
would use have our functionality ( Sending out emails ) on an oozie job which will run on a separate thread
and will be triggered from the HBase observer coprocessor
Oozie workflow – Java job using 3rd Party jars
• The Oozie workflow will contain a java job to send out emails using SMTP, will be invoked using Oozie Java API
• Workflow will have a dedicated folder in HDFS with below structure and files ( Each files described later.. )
- /user/oozie/OozieWFConfigs/emailAppDef
- job.properties ( file to hold the specific properties of the workflow )
- workflow.xml ( the workflow design )
- /lib
- <Our Jar File>.jar, activation.jar , mail.jar
How we use HBase coprocessor & Oozie
Target : To write a simple java program with main class, to send out emails to users using SMTP.
Steps :
1. Create a Simple Java project, with package org.
2. Create a class EmailJava.java as below :
EmailJava.jar Implementation
package org;
public class Emails {
private static byte[] Data;
public static void main(String [] args) {
try {
String barcode = args[0]; String pdfContent = args[1];
BASE64Decoder decoder = new BASE64Decoder();
Data = decoder.decodeBuffer(pdfContent);
Properties props = System.getProperties();
props.put("mail.smtp.host","mail.company.com");
props.put("mail.smtp.auth","true");
Session session = Session.getInstance(props, null);
Message msg = new MimeMessage(session);
msg.setFrom(new InternetAddress("oozie@company.com"));
msg.setRecipients(Message.RecipientType.TO,InternetAddress.parse("<user email address>", false));
msg.setSubject("Your Purchase Receipt "+ barcode );
msg.setHeader("X-Mailer", "Company CRM Communications");
String content = "Thank you for shopping. Please find the attached receipt for your purchase.";
MimeBodyPart textBodyPart = new MimeBodyPart();
textBodyPart.setText(content);
3. Compile the code with mail.jar and activation.jar included in the referenced libraries to create
EmailJava.jar
EmailJava.jar Implementation
MimeBodyPart pdfBodyPart = new MimeBodyPart();
pdfBodyPart.setDataHandler(new DataHandler(new ByteArrayDataSource(Data, "application/pdf")));
pdfBodyPart.setFileName("Receipt_" + barcode + ".pdf");
MimeMultipart mimeMultipart = new MimeMultipart();
mimeMultipart.addBodyPart(textBodyPart);
mimeMultipart.addBodyPart(pdfBodyPart);
msg.setContent(mimeMultipart);
msg.setSentDate(new Date());
SMTPTransport t =
(SMTPTransport)session.getTransport("smtp");
System.out.println("Attempting to connect to the SMTP Server");
t.connect("mail.company.com", "<SMTP_UserName>", "<SMTP_Password>");
System.out.println("Connection succeeded. Attempting to send email.");
t.sendMessage(msg, msg.getAllRecipients());
System.out.println("Response: " + t.getLastServerResponse());
t.close();
}
catch(Exception ex)
{
ex.printStackTrace();
}
}
}
Target : As the oozie workflow have to be triggered from a coprocessor, we will first
implement an oozie workflow to send out emails to users. We define the workflow
using XML and will test it within Cloudera environment.
Steps :
1. In HDFS file browser create a new folder “emailAppDef” ( say in the path :
“/user/oozie/OozieWFConfigs/emailAppDef” )
2. Within the folder create a file “workflow.xml”. Edit the file to have contents like below :
Oozie workflow- Implementation
<workflow-app name="Email" xmlns="uri:oozie:workflow:0.5">
<start to="java-95a1"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="java-95a1">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>org.Emails</main-class>
<arg>${barcode}</arg>
<arg>${pdfContent}</arg>
<file>/user/oozie/OozieWFConfigs/emailAppDef/lib/mail.jar#mail.jar</file>
<file>/user/oozie/OozieWFConfigs/emailAppDef/lib/activation.jar#activation.jar</file>
<file>/user/oozie/OozieWFConfigs/emailAppDef/lib/EmailJava.jar#EmailJava.jar</file>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
Steps :
3. With the XML in the last step, we have mentioned that main class is “org.Emails”, and then we have
included 3 files – mail.jar, activation.jar and EmailJava.jar , where EmailJava.jar is the program we have
written and contains the main class “org.Emails”
4. Create another file within the same directory “emailAppDef” with name “job.properties”. This file will
hold the properties of the workflow. The contents of the job.properties file will be as :
nameNode=hdfs://<namenode IP Address>:8020
jobTracker= <namenode IP Address>:8021
queueName=default
oozie.wf.application.path=/user/jinith.joseph/OozieWFConfigs/emailAppDef
oozie.use.system.libpath=true
5. Create a new folder “lib” within “emailAppDef” , which will hold the main class jar file and the dependent
3rd party jars.
6. Add the EmaiJava.jar, mail.jar and activation.jar into the “lib” folder.
7. We have now finished defining the workflow to invoke a java job from oozie.
Oozie workflow- Implementation
Target : To create a coprocessor to create a history of all insertions in a HBase table and
trigger an oozie workflow in Java.
Steps : Create a simple java project referencing Hadoop, logging, HBase, Common and
Oozie Client jars. Create a java class as below which will override the start() and
postPut() function, which will be executed when there is any activity in the region and
post adding a record in the HBase table respectively.
HBase Coprocessor - Implementation
public class incCoProc extends BaseRegionObserver {
private byte[] master;
private byte[] family;
private byte[] flags = Bytes.toBytes("flags");
private Log log;
@Override
public void start(CoprocessorEnvironment e) throws IOException {
Configuration conf = e.getConfiguration();
master = Bytes.toBytes(conf.get("master"));
family = Bytes.toBytes(conf.get("family"));
}
The start function will be executed when
the coprocessor is associated to a HBase
table or all of them. You could provide
parameters to this function.
Here, the start function accepts couple of
arguments to be initiated. I have used two
arguments here , which are picked from
the configuration, which denotes a table
name and a family name for operations
further.
HBase Coprocessor - Implementation
@Override
public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) {
try
{
final RegionCoprocessorEnvironment env = e.getEnvironment();
final byte[] row = put.getRow();
Get get = new Get(row);
final Result result = env.getRegion().get(get);
new Thread(new Runnable() {
public void run() {
try {
insert(env, row, result);
} catch (IOException e1) {
e1.printStackTrace();
}
}
}).start();
}
catch(IOException ex)
{ }
}
postPut() which should be overridden, will
be called after there is any insertion on the
associated Hbase tables.
Here we have got the environment details,
and the row details of the new record in the
HBase table. And called a function on a
separate thread to minimize insertion time
to the parent HBase table.
* After each insert the postPut() will be
executed and the time of insertion to the
HBase table is linearly related to the time
required to complete the actions.
HBase Coprocessor - Implementation
private void insert(RegionCoprocessorEnvironment env, byte[] row, Result r) throws IOException {
Table masterTable = env.getTable(TableName.valueOf(master));
try {
CellScanner scanner = r.cellScanner(); StringBuilder str = new StringBuilder();
int count = 0;
while (scanner.advance()) {
Cell cell = scanner.current();
byte[] qualifier = CellUtil.cloneQualifier(cell);
byte[] value = CellUtil.cloneValue(cell);
if (count > 0)
str.append("|");
str.append(Bytes.toString(qualifier)).append("=").append(Bytes.toString(value));
count++;
}
Put put = new Put(row); put.addColumn(family, row, Bytes.toBytes(str.toString()));
put.addColumn(flags,Bytes.toBytes("EmailSend"),Bytes.toBytes("false"));
put.addColumn(flags, Bytes.toBytes("Archived"), Bytes.toBytes("false"));
put.addColumn(flags, Bytes.toBytes("Flattened"), Bytes.toBytes("false"));
masterTable.put(put);
BASE64Encoder encoder = new BASE64Encoder();
final String pdfContent = encoder.encodeBuffer(Bytes.toBytes(str.toString()));
OozieJobInvoke(pdfContent);
}
catch(Exception ex)
{ }
finally {
masterTable.close();
}
}
In this Insert function, we have read the
data of the row newly inserted into the
parent HBase table, and inserted the same
with some extra fields ( flags) to a log table
(master)
For every Put on table A (with a row key
"<rowid>"), a cell is put into a master
table with row key "<row id>", qualifier
"<row id>" and value a concatenation of
qualifiers and values of the row in A.
Master table name and column family are
passed as arguments.
And finally initiating a oozie job call with
the content of the data ( pdfContent )
HBase Coprocessor - Implementation
private void OozieJobInvoke(String pdfContent) {
OozieClient wc = new OozieClient("http://hdfs:hdfs@<name node>:11000/oozie");
Properties conf = wc.createConfiguration();
conf.setProperty("nameNode", "hdfs://<name node>:8020");
conf.setProperty("jobTracker", “<name node>:8032");
conf.setProperty("queueName", "default");
conf.setProperty("oozie.libpath", "${nameNode}/user/oozie/OozieWFConfigs/emailAppDef/lib");
conf.setProperty("oozie.use.system.libpath", "true");
conf.setProperty("oozie.wf.rerun.failnodes", "true");
conf.setProperty("oozieProjectRoot",
"${nameNode}/user/jinith.joseph/OozieWFConfigs/emailAppDef");
conf.setProperty("appPath",
"${nameNode}/user/jinith.joseph/OozieWFConfigs/emailAppDef");
conf.setProperty(OozieClient.APP_PATH, "${appPath}/workflow.xml");
conf.setProperty("barcode", "0MN20151228102N21452");
conf.setProperty("pdfContent", pdfContent);
try {
String jobId = wc.run(conf);
} catch (Exception r) {
}
}
In this function were the oozie job is called,
we have specified various configurations
for oozie job to pick up and then mentioned
the project path and the application path
(which is the HDFS path of the oozie
workflow xml we have created and
uploaded ).
Once the oozieclient.run() is called, the
oozie job will be submitted. The
oozie.libpath will denote the jar files which
has to be included in the oozie workflow.
Oozie workflow has a couple of arguments
which could be set using conf.setProperty()
function (barcode, pdfContent)
As we have already defined the Oozie
workflow to execute a java job, the java job
will be called and the email send.
HBase Coprocessor – Assign to a Hbase
table
Target : Associate the HBase coprocessor to a HBase table
Steps :
1. Upload the Coprocessor jar file in a HDFS location
2. Start hbase shell and create the required tables.
3. Disable the table to which the coprocessor is being associated.
4. Use the below command to associate the HBase coprocessor to the HBase table :
hbase(main):049:0> alter ‘<HBase table name>', METHOD => 'table_att', 'coprocessor' =>
'hdfs:///user/oozie/JARS/incCoProc .jar|com. incCoProc ||master=log,family=family‘
while the coprocessor is getting associated to the HBase table, you will see that the regions are updated on the shell.
5. Enable the table to which the coprocessor was attached.
6. If you use describe ‘<HBase table name>‘ from hbase shell , you should see that the coprocessor is attached to the HBase table.
7. Try insertion new records to the HBase table to which the coprocessor was attached, you should see that a history is maintained
in the “log” table and emails send out to the configured user.
Master and family are
arguments being passed to
the Hbase Co processor
where the overridden start()
function will use it for other
operations.
So that every insertion on
the master table will be
inserted into the “log” table
with family name “family”
HBase Coprocessor – Unset from a Hbase
table
Target : Unset the HBase coprocessor from a HBase table
Steps :
1. Start hbase shell and create the required tables.
2. Disable the table to which the coprocessor is being associated.
3. Use the below command to associate the HBase coprocessor to the HBase table :
hbase(main):049:0> alter ‘<HBase table name>', METHOD => 'table_att_unset', NAME => 'coprocessor$1‘
4. Enable the table to which the coprocessor was attached.
5. Try insertion new records to the HBase table to which the coprocessor was attached, no actions will be triggered.
Thank You!
Hope you have got an introduction to HBase coprocessors and Oozie jobs and how they could be
related for various functionalities.

More Related Content

PPTX
Sherlock Homepage - A detective story about running large web services - WebN...
PPTX
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
PPTX
DNS for Developers - NDC Oslo 2016
PDF
Working with Hive Analytics
PPTX
Everything you wanted to know, but were afraid to ask about Oozie
PDF
High Performance Hibernate JavaZone 2016
PPTX
Hadoop Oozie
PPTX
July 2012 HUG: Overview of Oozie Qualification Process
Sherlock Homepage - A detective story about running large web services - WebN...
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
DNS for Developers - NDC Oslo 2016
Working with Hive Analytics
Everything you wanted to know, but were afraid to ask about Oozie
High Performance Hibernate JavaZone 2016
Hadoop Oozie
July 2012 HUG: Overview of Oozie Qualification Process

What's hot (20)

PDF
Consuming RESTful services in PHP
PDF
Oozie HUG May12
PPTX
Clogeny Hadoop ecosystem - an overview
ODP
An Overview of Node.js
PDF
Apache spark with akka couchbase code by bhawani
PPTX
BD-zero lecture.pptx
PDF
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
PDF
High-Performance Hibernate - JDK.io 2018
PPTX
Ex-8-hive.pptx
PPTX
Apache Oozie
PDF
October 2013 HUG: Oozie 4.x
PDF
Event Processing and Integration with IAS Data Processors
PDF
Mule caching strategy with redis cache
PDF
AD102 - Break out of the Box
PDF
Solr Application Development Tutorial
PPTX
Break out of The Box - Part 2
PPTX
Oozie or Easy: Managing Hadoop Workloads the EASY Way
PPTX
IBM Connect 2016 - Break out of the Box
PDF
COScheduler
PDF
Wizard of ORDS
Consuming RESTful services in PHP
Oozie HUG May12
Clogeny Hadoop ecosystem - an overview
An Overview of Node.js
Apache spark with akka couchbase code by bhawani
BD-zero lecture.pptx
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
High-Performance Hibernate - JDK.io 2018
Ex-8-hive.pptx
Apache Oozie
October 2013 HUG: Oozie 4.x
Event Processing and Integration with IAS Data Processors
Mule caching strategy with redis cache
AD102 - Break out of the Box
Solr Application Development Tutorial
Break out of The Box - Part 2
Oozie or Easy: Managing Hadoop Workloads the EASY Way
IBM Connect 2016 - Break out of the Box
COScheduler
Wizard of ORDS
Ad

Similar to Hbase coprocessor with Oozie WF referencing 3rd Party jars (20)

PPTX
Apache Oozie
DOCX
Node js getting started
PPT
nodejs_at_a_glance.ppt
PPTX
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
PPT
nodejs_at_a_glance, understanding java script
PPTX
PPTX
Unit 5-apache hive
PPTX
Splice Machine Overview
PPTX
Unit 5
PPTX
Academy PRO: HTML5 Data storage
PPTX
מיכאל
PPTX
Debugging Hive with Hadoop-in-the-Cloud
PPTX
De-Bugging Hive with Hadoop-in-the-Cloud
PPTX
My Saminar On Php
PDF
Yahoo! Hack Europe Workshop
PPTX
Etl with talend (big data)
PDF
Basic API Creation with Node.JS
PPTX
Apache Oozie
Node js getting started
nodejs_at_a_glance.ppt
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
nodejs_at_a_glance, understanding java script
Unit 5-apache hive
Splice Machine Overview
Unit 5
Academy PRO: HTML5 Data storage
מיכאל
Debugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
My Saminar On Php
Yahoo! Hack Europe Workshop
Etl with talend (big data)
Basic API Creation with Node.JS
Ad

Recently uploaded (20)

PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Tartificialntelligence_presentation.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPT
What is a Computer? Input Devices /output devices
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
STKI Israel Market Study 2025 version august
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Unlock new opportunities with location data.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
WOOl fibre morphology and structure.pdf for textiles
sustainability-14-14877-v2.pddhzftheheeeee
Tartificialntelligence_presentation.pptx
1 - Historical Antecedents, Social Consideration.pdf
Enhancing emotion recognition model for a student engagement use case through...
NewMind AI Weekly Chronicles – August ’25 Week III
What is a Computer? Input Devices /output devices
Zenith AI: Advanced Artificial Intelligence
Taming the Chaos: How to Turn Unstructured Data into Decisions
observCloud-Native Containerability and monitoring.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
O2C Customer Invoices to Receipt V15A.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
STKI Israel Market Study 2025 version august
Chapter 5: Probability Theory and Statistics
Unlock new opportunities with location data.pdf
Final SEM Unit 1 for mit wpu at pune .pptx

Hbase coprocessor with Oozie WF referencing 3rd Party jars

  • 1. Email notifications from HBase data Hadoop : HBase Coprocessors and Oozie jobs Jinith Joseph
  • 2. • Introduction to Hbase co processors and Oozie java jobs • Notifications from HBase using HBase co processors & Oozie jobs • Implementation on Cloudera cdh 5.4.0 without flume Requirement
  • 3. Scope of Coprocessors • System Level : Coprocessors can be configured to work on every tables and regions • Table Level : Coprocessors can be configured to work on all regions of a specific table Types of Coprocessors • Observer coprocessor ( Acts like triggers in conventional databases ) This allows users to insert custom code by overriding the upcall methods provided by the coprocessor framework. The callback functions are executed from core HBase code when certain events occur. • Endpoint coprocessor ( Resembled stored procedures in conventional databases ) User can invoke the end point at any time from the client, which will be executed remotely at the target or all regions and the results will be returned to the client. Overview of HBase Coprocessor
  • 4. Observer Coprocessor Three types of Observers • Region Observer These observers provide hooks to data manipulation events like get , put, delete, scan , etc on HBase tables. For every table region, there will be an instance of RegionObserver coprocessor. But the scope of the observers could be set to a specific region or across all regions. • WAL Observer Provides hooks for write-ahead log (WAL) related operations. These runs in the context of WAL processing. This is used for WAL writing and reconstruction events (Eg: Bulk load of HBase tables using HFile). • Master Observer These observes provides hook on the DDL operations like create / alter / delete tables. The master observer runs within the context of HBase master.
  • 5. Oozie Apache Oozie is a system for running workflows of dependent jobs, and contains two main engines : • workflow engine – This stores and runs workflows composed of different types of Hadoop jobs • coordinator engine - This runs workflow jobs based on predefined schedules and data availability. It allows the user to define and execute recurrent and interdependent workflow jobs The Oozie web UI will be available in the URL : http://namdenode:11000 Using Oozie workflow different tasks can be setup up as a part of workflow including : • Hive Script, Pig , Spark, Java, Sqoop, MapReduce, Shell, Ssh, HDFS Fs, Email, Streaming, Distcp, etc. Oozie workflows generally follows a transition Diagram to report its status as : Start Actions End Fail Error Success
  • 6. How to Connect these? HBase Data TableData insertion to HBase Coprocessor Region Observer Coprocessor postput() which will be invoked after the insertion of the data in HBase table has happened HBase Log Table Oozie job Invoke Oozie job with parameters on a different thread Using 3rd Party .jars – mail.jar and activation.jar Frame the email content from parameters Using SMTP send out the email to users. * Might not be an ideal way to run in production environment. Please consider the rate of data and the threads invoked to run Oozie jobs JAVA JAVA
  • 7. Region Observer Coprocessor – postput() • The client code will be written in Java which would override the upcall methods of coprocessor framework and implement Observer coprocessor, which will be initiated as soon as there is a data available in the HBase table. • The insertion of the data will consume the same time before any coprocessors are finished. So any bulk code on the Coprocessor (overriding put events), will cost us on inserting data in the HBase tables. So here we would use have our functionality ( Sending out emails ) on an oozie job which will run on a separate thread and will be triggered from the HBase observer coprocessor Oozie workflow – Java job using 3rd Party jars • The Oozie workflow will contain a java job to send out emails using SMTP, will be invoked using Oozie Java API • Workflow will have a dedicated folder in HDFS with below structure and files ( Each files described later.. ) - /user/oozie/OozieWFConfigs/emailAppDef - job.properties ( file to hold the specific properties of the workflow ) - workflow.xml ( the workflow design ) - /lib - <Our Jar File>.jar, activation.jar , mail.jar How we use HBase coprocessor & Oozie
  • 8. Target : To write a simple java program with main class, to send out emails to users using SMTP. Steps : 1. Create a Simple Java project, with package org. 2. Create a class EmailJava.java as below : EmailJava.jar Implementation package org; public class Emails { private static byte[] Data; public static void main(String [] args) { try { String barcode = args[0]; String pdfContent = args[1]; BASE64Decoder decoder = new BASE64Decoder(); Data = decoder.decodeBuffer(pdfContent); Properties props = System.getProperties(); props.put("mail.smtp.host","mail.company.com"); props.put("mail.smtp.auth","true"); Session session = Session.getInstance(props, null); Message msg = new MimeMessage(session); msg.setFrom(new InternetAddress("oozie@company.com")); msg.setRecipients(Message.RecipientType.TO,InternetAddress.parse("<user email address>", false)); msg.setSubject("Your Purchase Receipt "+ barcode ); msg.setHeader("X-Mailer", "Company CRM Communications"); String content = "Thank you for shopping. Please find the attached receipt for your purchase."; MimeBodyPart textBodyPart = new MimeBodyPart(); textBodyPart.setText(content);
  • 9. 3. Compile the code with mail.jar and activation.jar included in the referenced libraries to create EmailJava.jar EmailJava.jar Implementation MimeBodyPart pdfBodyPart = new MimeBodyPart(); pdfBodyPart.setDataHandler(new DataHandler(new ByteArrayDataSource(Data, "application/pdf"))); pdfBodyPart.setFileName("Receipt_" + barcode + ".pdf"); MimeMultipart mimeMultipart = new MimeMultipart(); mimeMultipart.addBodyPart(textBodyPart); mimeMultipart.addBodyPart(pdfBodyPart); msg.setContent(mimeMultipart); msg.setSentDate(new Date()); SMTPTransport t = (SMTPTransport)session.getTransport("smtp"); System.out.println("Attempting to connect to the SMTP Server"); t.connect("mail.company.com", "<SMTP_UserName>", "<SMTP_Password>"); System.out.println("Connection succeeded. Attempting to send email."); t.sendMessage(msg, msg.getAllRecipients()); System.out.println("Response: " + t.getLastServerResponse()); t.close(); } catch(Exception ex) { ex.printStackTrace(); } } }
  • 10. Target : As the oozie workflow have to be triggered from a coprocessor, we will first implement an oozie workflow to send out emails to users. We define the workflow using XML and will test it within Cloudera environment. Steps : 1. In HDFS file browser create a new folder “emailAppDef” ( say in the path : “/user/oozie/OozieWFConfigs/emailAppDef” ) 2. Within the folder create a file “workflow.xml”. Edit the file to have contents like below : Oozie workflow- Implementation <workflow-app name="Email" xmlns="uri:oozie:workflow:0.5"> <start to="java-95a1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="java-95a1"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>org.Emails</main-class> <arg>${barcode}</arg> <arg>${pdfContent}</arg> <file>/user/oozie/OozieWFConfigs/emailAppDef/lib/mail.jar#mail.jar</file> <file>/user/oozie/OozieWFConfigs/emailAppDef/lib/activation.jar#activation.jar</file> <file>/user/oozie/OozieWFConfigs/emailAppDef/lib/EmailJava.jar#EmailJava.jar</file> </java> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
  • 11. Steps : 3. With the XML in the last step, we have mentioned that main class is “org.Emails”, and then we have included 3 files – mail.jar, activation.jar and EmailJava.jar , where EmailJava.jar is the program we have written and contains the main class “org.Emails” 4. Create another file within the same directory “emailAppDef” with name “job.properties”. This file will hold the properties of the workflow. The contents of the job.properties file will be as : nameNode=hdfs://<namenode IP Address>:8020 jobTracker= <namenode IP Address>:8021 queueName=default oozie.wf.application.path=/user/jinith.joseph/OozieWFConfigs/emailAppDef oozie.use.system.libpath=true 5. Create a new folder “lib” within “emailAppDef” , which will hold the main class jar file and the dependent 3rd party jars. 6. Add the EmaiJava.jar, mail.jar and activation.jar into the “lib” folder. 7. We have now finished defining the workflow to invoke a java job from oozie. Oozie workflow- Implementation
  • 12. Target : To create a coprocessor to create a history of all insertions in a HBase table and trigger an oozie workflow in Java. Steps : Create a simple java project referencing Hadoop, logging, HBase, Common and Oozie Client jars. Create a java class as below which will override the start() and postPut() function, which will be executed when there is any activity in the region and post adding a record in the HBase table respectively. HBase Coprocessor - Implementation public class incCoProc extends BaseRegionObserver { private byte[] master; private byte[] family; private byte[] flags = Bytes.toBytes("flags"); private Log log; @Override public void start(CoprocessorEnvironment e) throws IOException { Configuration conf = e.getConfiguration(); master = Bytes.toBytes(conf.get("master")); family = Bytes.toBytes(conf.get("family")); } The start function will be executed when the coprocessor is associated to a HBase table or all of them. You could provide parameters to this function. Here, the start function accepts couple of arguments to be initiated. I have used two arguments here , which are picked from the configuration, which denotes a table name and a family name for operations further.
  • 13. HBase Coprocessor - Implementation @Override public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) { try { final RegionCoprocessorEnvironment env = e.getEnvironment(); final byte[] row = put.getRow(); Get get = new Get(row); final Result result = env.getRegion().get(get); new Thread(new Runnable() { public void run() { try { insert(env, row, result); } catch (IOException e1) { e1.printStackTrace(); } } }).start(); } catch(IOException ex) { } } postPut() which should be overridden, will be called after there is any insertion on the associated Hbase tables. Here we have got the environment details, and the row details of the new record in the HBase table. And called a function on a separate thread to minimize insertion time to the parent HBase table. * After each insert the postPut() will be executed and the time of insertion to the HBase table is linearly related to the time required to complete the actions.
  • 14. HBase Coprocessor - Implementation private void insert(RegionCoprocessorEnvironment env, byte[] row, Result r) throws IOException { Table masterTable = env.getTable(TableName.valueOf(master)); try { CellScanner scanner = r.cellScanner(); StringBuilder str = new StringBuilder(); int count = 0; while (scanner.advance()) { Cell cell = scanner.current(); byte[] qualifier = CellUtil.cloneQualifier(cell); byte[] value = CellUtil.cloneValue(cell); if (count > 0) str.append("|"); str.append(Bytes.toString(qualifier)).append("=").append(Bytes.toString(value)); count++; } Put put = new Put(row); put.addColumn(family, row, Bytes.toBytes(str.toString())); put.addColumn(flags,Bytes.toBytes("EmailSend"),Bytes.toBytes("false")); put.addColumn(flags, Bytes.toBytes("Archived"), Bytes.toBytes("false")); put.addColumn(flags, Bytes.toBytes("Flattened"), Bytes.toBytes("false")); masterTable.put(put); BASE64Encoder encoder = new BASE64Encoder(); final String pdfContent = encoder.encodeBuffer(Bytes.toBytes(str.toString())); OozieJobInvoke(pdfContent); } catch(Exception ex) { } finally { masterTable.close(); } } In this Insert function, we have read the data of the row newly inserted into the parent HBase table, and inserted the same with some extra fields ( flags) to a log table (master) For every Put on table A (with a row key "<rowid>"), a cell is put into a master table with row key "<row id>", qualifier "<row id>" and value a concatenation of qualifiers and values of the row in A. Master table name and column family are passed as arguments. And finally initiating a oozie job call with the content of the data ( pdfContent )
  • 15. HBase Coprocessor - Implementation private void OozieJobInvoke(String pdfContent) { OozieClient wc = new OozieClient("http://hdfs:hdfs@<name node>:11000/oozie"); Properties conf = wc.createConfiguration(); conf.setProperty("nameNode", "hdfs://<name node>:8020"); conf.setProperty("jobTracker", “<name node>:8032"); conf.setProperty("queueName", "default"); conf.setProperty("oozie.libpath", "${nameNode}/user/oozie/OozieWFConfigs/emailAppDef/lib"); conf.setProperty("oozie.use.system.libpath", "true"); conf.setProperty("oozie.wf.rerun.failnodes", "true"); conf.setProperty("oozieProjectRoot", "${nameNode}/user/jinith.joseph/OozieWFConfigs/emailAppDef"); conf.setProperty("appPath", "${nameNode}/user/jinith.joseph/OozieWFConfigs/emailAppDef"); conf.setProperty(OozieClient.APP_PATH, "${appPath}/workflow.xml"); conf.setProperty("barcode", "0MN20151228102N21452"); conf.setProperty("pdfContent", pdfContent); try { String jobId = wc.run(conf); } catch (Exception r) { } } In this function were the oozie job is called, we have specified various configurations for oozie job to pick up and then mentioned the project path and the application path (which is the HDFS path of the oozie workflow xml we have created and uploaded ). Once the oozieclient.run() is called, the oozie job will be submitted. The oozie.libpath will denote the jar files which has to be included in the oozie workflow. Oozie workflow has a couple of arguments which could be set using conf.setProperty() function (barcode, pdfContent) As we have already defined the Oozie workflow to execute a java job, the java job will be called and the email send.
  • 16. HBase Coprocessor – Assign to a Hbase table Target : Associate the HBase coprocessor to a HBase table Steps : 1. Upload the Coprocessor jar file in a HDFS location 2. Start hbase shell and create the required tables. 3. Disable the table to which the coprocessor is being associated. 4. Use the below command to associate the HBase coprocessor to the HBase table : hbase(main):049:0> alter ‘<HBase table name>', METHOD => 'table_att', 'coprocessor' => 'hdfs:///user/oozie/JARS/incCoProc .jar|com. incCoProc ||master=log,family=family‘ while the coprocessor is getting associated to the HBase table, you will see that the regions are updated on the shell. 5. Enable the table to which the coprocessor was attached. 6. If you use describe ‘<HBase table name>‘ from hbase shell , you should see that the coprocessor is attached to the HBase table. 7. Try insertion new records to the HBase table to which the coprocessor was attached, you should see that a history is maintained in the “log” table and emails send out to the configured user. Master and family are arguments being passed to the Hbase Co processor where the overridden start() function will use it for other operations. So that every insertion on the master table will be inserted into the “log” table with family name “family”
  • 17. HBase Coprocessor – Unset from a Hbase table Target : Unset the HBase coprocessor from a HBase table Steps : 1. Start hbase shell and create the required tables. 2. Disable the table to which the coprocessor is being associated. 3. Use the below command to associate the HBase coprocessor to the HBase table : hbase(main):049:0> alter ‘<HBase table name>', METHOD => 'table_att_unset', NAME => 'coprocessor$1‘ 4. Enable the table to which the coprocessor was attached. 5. Try insertion new records to the HBase table to which the coprocessor was attached, no actions will be triggered.
  • 18. Thank You! Hope you have got an introduction to HBase coprocessors and Oozie jobs and how they could be related for various functionalities.