Hbase coprocessor with Oozie WF referencing 3rd Party jars

Email notifications from HBase data
Hadoop : HBase
Coprocessors and
Oozie jobs
Jinith Joseph

• Introduction to Hbase co processors and Oozie java jobs
• Notifications from HBase using HBase co processors & Oozie jobs
• Implementation on Cloudera cdh 5.4.0 without flume
Requirement

Scope of Coprocessors
• System Level : Coprocessors can be configured to work on every tables and regions
• Table Level : Coprocessors can be configured to work on all regions of a specific table
Types of Coprocessors
• Observer coprocessor ( Acts like triggers in conventional databases )
This allows users to insert custom code by overriding the upcall methods provided by the coprocessor framework. The callback
functions are executed from core HBase code when certain events occur.
• Endpoint coprocessor ( Resembled stored procedures in conventional databases )
User can invoke the end point at any time from the client, which will be executed remotely at the target or all regions and the
results will be returned to the client.
Overview of HBase Coprocessor

Observer Coprocessor
Three types of Observers
• Region Observer
These observers provide hooks to data manipulation events like get , put, delete, scan , etc on HBase tables.
For every table region, there will be an instance of RegionObserver coprocessor. But the scope of the
observers could be set to a specific region or across all regions.
• WAL Observer
Provides hooks for write-ahead log (WAL) related operations. These runs in the context of WAL
processing. This is used for WAL writing and reconstruction events (Eg: Bulk load of HBase tables using HFile).
• Master Observer
These observes provides hook on the DDL operations like create / alter / delete tables. The master observer
runs within the context of HBase master.

Oozie
Apache Oozie is a system for running workflows of dependent jobs, and contains two main
engines :
• workflow engine – This stores and runs workflows composed of different types of Hadoop jobs
• coordinator engine - This runs workflow jobs based on predefined schedules and data availability. It allows the
user to define and execute recurrent and interdependent workflow jobs
The Oozie web UI will be available in the URL : http://namdenode:11000
Using Oozie workflow different tasks can be setup up as a part of workflow including :
• Hive Script, Pig , Spark, Java, Sqoop, MapReduce, Shell, Ssh, HDFS Fs, Email, Streaming, Distcp, etc.
Oozie workflows generally follows a transition Diagram to report its status as :
Start Actions
End
Fail
Error
Success

How to Connect these?
HBase
Data TableData insertion to HBase
Coprocessor
Region Observer Coprocessor
postput() which will be invoked after
the insertion of the data in HBase
table has happened
HBase
Log Table Oozie job
Invoke Oozie job with parameters
on a different thread
Using 3rd Party .jars –
mail.jar and activation.jar
Frame the email content
from parameters
Using SMTP send out the
email to users.
* Might not be an ideal way to run in production environment. Please consider the rate of data and the threads invoked to run Oozie jobs
JAVA
JAVA

Region Observer Coprocessor – postput()
• The client code will be written in Java which would override the upcall methods of coprocessor framework
and implement Observer coprocessor, which will be initiated as soon as there is a data available in the HBase
table.
• The insertion of the data will consume the same time before any coprocessors are finished. So any bulk code
on the Coprocessor (overriding put events), will cost us on inserting data in the HBase tables. So here we
would use have our functionality ( Sending out emails ) on an oozie job which will run on a separate thread
and will be triggered from the HBase observer coprocessor
Oozie workflow – Java job using 3rd Party jars
• The Oozie workflow will contain a java job to send out emails using SMTP, will be invoked using Oozie Java API
• Workflow will have a dedicated folder in HDFS with below structure and files ( Each files described later.. )
- /user/oozie/OozieWFConfigs/emailAppDef
- job.properties ( file to hold the specific properties of the workflow )
- workflow.xml ( the workflow design )
- /lib
- <Our Jar File>.jar, activation.jar , mail.jar
How we use HBase coprocessor & Oozie

Target : To write a simple java program with main class, to send out emails to users using SMTP.
Steps :
1. Create a Simple Java project, with package org.
2. Create a class EmailJava.java as below :
EmailJava.jar Implementation
package org;
public class Emails {
private static byte[] Data;
public static void main(String [] args) {
try {
String barcode = args[0]; String pdfContent = args[1];
BASE64Decoder decoder = new BASE64Decoder();
Data = decoder.decodeBuffer(pdfContent);
Properties props = System.getProperties();
props.put("mail.smtp.host","mail.company.com");
props.put("mail.smtp.auth","true");
Session session = Session.getInstance(props, null);
Message msg = new MimeMessage(session);
msg.setFrom(new InternetAddress("oozie@company.com"));
msg.setRecipients(Message.RecipientType.TO,InternetAddress.parse("<user email address>", false));
msg.setSubject("Your Purchase Receipt "+ barcode );
msg.setHeader("X-Mailer", "Company CRM Communications");
String content = "Thank you for shopping. Please find the attached receipt for your purchase.";
MimeBodyPart textBodyPart = new MimeBodyPart();
textBodyPart.setText(content);

3. Compile the code with mail.jar and activation.jar included in the referenced libraries to create
EmailJava.jar
EmailJava.jar Implementation
MimeBodyPart pdfBodyPart = new MimeBodyPart();
pdfBodyPart.setDataHandler(new DataHandler(new ByteArrayDataSource(Data, "application/pdf")));
pdfBodyPart.setFileName("Receipt_" + barcode + ".pdf");
MimeMultipart mimeMultipart = new MimeMultipart();
mimeMultipart.addBodyPart(textBodyPart);
mimeMultipart.addBodyPart(pdfBodyPart);
msg.setContent(mimeMultipart);
msg.setSentDate(new Date());
SMTPTransport t =
(SMTPTransport)session.getTransport("smtp");
System.out.println("Attempting to connect to the SMTP Server");
t.connect("mail.company.com", "<SMTP_UserName>", "<SMTP_Password>");
System.out.println("Connection succeeded. Attempting to send email.");
t.sendMessage(msg, msg.getAllRecipients());
System.out.println("Response: " + t.getLastServerResponse());
t.close();
}
catch(Exception ex)
{
ex.printStackTrace();
}
}
}

Target : As the oozie workflow have to be triggered from a coprocessor, we will first
implement an oozie workflow to send out emails to users. We define the workflow
using XML and will test it within Cloudera environment.
Steps :
1. In HDFS file browser create a new folder “emailAppDef” ( say in the path :
“/user/oozie/OozieWFConfigs/emailAppDef” )
2. Within the folder create a file “workflow.xml”. Edit the file to have contents like below :
Oozie workflow- Implementation
<workflow-app name="Email" xmlns="uri:oozie:workflow:0.5">
<start to="java-95a1"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="java-95a1">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>org.Emails</main-class>
<arg>${barcode}</arg>
<arg>${pdfContent}</arg>
<file>/user/oozie/OozieWFConfigs/emailAppDef/lib/mail.jar#mail.jar</file>
<file>/user/oozie/OozieWFConfigs/emailAppDef/lib/activation.jar#activation.jar</file>
<file>/user/oozie/OozieWFConfigs/emailAppDef/lib/EmailJava.jar#EmailJava.jar</file>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

Steps :
3. With the XML in the last step, we have mentioned that main class is “org.Emails”, and then we have
included 3 files – mail.jar, activation.jar and EmailJava.jar , where EmailJava.jar is the program we have
written and contains the main class “org.Emails”
4. Create another file within the same directory “emailAppDef” with name “job.properties”. This file will
hold the properties of the workflow. The contents of the job.properties file will be as :
nameNode=hdfs://<namenode IP Address>:8020
jobTracker= <namenode IP Address>:8021
queueName=default
oozie.wf.application.path=/user/jinith.joseph/OozieWFConfigs/emailAppDef
oozie.use.system.libpath=true
5. Create a new folder “lib” within “emailAppDef” , which will hold the main class jar file and the dependent
3rd party jars.
6. Add the EmaiJava.jar, mail.jar and activation.jar into the “lib” folder.
7. We have now finished defining the workflow to invoke a java job from oozie.
Oozie workflow- Implementation

Target : To create a coprocessor to create a history of all insertions in a HBase table and
trigger an oozie workflow in Java.
Steps : Create a simple java project referencing Hadoop, logging, HBase, Common and
Oozie Client jars. Create a java class as below which will override the start() and
postPut() function, which will be executed when there is any activity in the region and
post adding a record in the HBase table respectively.
HBase Coprocessor - Implementation
public class incCoProc extends BaseRegionObserver {
private byte[] master;
private byte[] family;
private byte[] flags = Bytes.toBytes("flags");
private Log log;
@Override
public void start(CoprocessorEnvironment e) throws IOException {
Configuration conf = e.getConfiguration();
master = Bytes.toBytes(conf.get("master"));
family = Bytes.toBytes(conf.get("family"));
}
The start function will be executed when
the coprocessor is associated to a HBase
table or all of them. You could provide
parameters to this function.
Here, the start function accepts couple of
arguments to be initiated. I have used two
arguments here , which are picked from
the configuration, which denotes a table
name and a family name for operations
further.

@Override
public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) {
try
{
final RegionCoprocessorEnvironment env = e.getEnvironment();
final byte[] row = put.getRow();
Get get = new Get(row);
final Result result = env.getRegion().get(get);
new Thread(new Runnable() {
public void run() {
try {
insert(env, row, result);
} catch (IOException e1) {
e1.printStackTrace();
}
}
}).start();
}
catch(IOException ex)
{ }
}
postPut() which should be overridden, will
be called after there is any insertion on the
associated Hbase tables.
Here we have got the environment details,
and the row details of the new record in the
HBase table. And called a function on a
separate thread to minimize insertion time
to the parent HBase table.
* After each insert the postPut() will be
executed and the time of insertion to the
HBase table is linearly related to the time
required to complete the actions.

private void insert(RegionCoprocessorEnvironment env, byte[] row, Result r) throws IOException {
Table masterTable = env.getTable(TableName.valueOf(master));
try {
CellScanner scanner = r.cellScanner(); StringBuilder str = new StringBuilder();
int count = 0;
while (scanner.advance()) {
Cell cell = scanner.current();
byte[] qualifier = CellUtil.cloneQualifier(cell);
byte[] value = CellUtil.cloneValue(cell);
if (count > 0)
str.append("|");
str.append(Bytes.toString(qualifier)).append("=").append(Bytes.toString(value));
count++;
}
Put put = new Put(row); put.addColumn(family, row, Bytes.toBytes(str.toString()));
put.addColumn(flags,Bytes.toBytes("EmailSend"),Bytes.toBytes("false"));
put.addColumn(flags, Bytes.toBytes("Archived"), Bytes.toBytes("false"));
put.addColumn(flags, Bytes.toBytes("Flattened"), Bytes.toBytes("false"));
masterTable.put(put);
BASE64Encoder encoder = new BASE64Encoder();
final String pdfContent = encoder.encodeBuffer(Bytes.toBytes(str.toString()));
OozieJobInvoke(pdfContent);
}
catch(Exception ex)
{ }
finally {
masterTable.close();
}
}
In this Insert function, we have read the
data of the row newly inserted into the
parent HBase table, and inserted the same
with some extra fields ( flags) to a log table
(master)
For every Put on table A (with a row key
"<rowid>"), a cell is put into a master
table with row key "<row id>", qualifier
"<row id>" and value a concatenation of
qualifiers and values of the row in A.
Master table name and column family are
passed as arguments.
And finally initiating a oozie job call with
the content of the data ( pdfContent )

private void OozieJobInvoke(String pdfContent) {
OozieClient wc = new OozieClient("http://hdfs:hdfs@<name node>:11000/oozie");
Properties conf = wc.createConfiguration();
conf.setProperty("nameNode", "hdfs://<name node>:8020");
conf.setProperty("jobTracker", “<name node>:8032");
conf.setProperty("queueName", "default");
conf.setProperty("oozie.libpath", "${nameNode}/user/oozie/OozieWFConfigs/emailAppDef/lib");
conf.setProperty("oozie.use.system.libpath", "true");
conf.setProperty("oozie.wf.rerun.failnodes", "true");
conf.setProperty("oozieProjectRoot",
"${nameNode}/user/jinith.joseph/OozieWFConfigs/emailAppDef");
conf.setProperty("appPath",
"${nameNode}/user/jinith.joseph/OozieWFConfigs/emailAppDef");
conf.setProperty(OozieClient.APP_PATH, "${appPath}/workflow.xml");
conf.setProperty("barcode", "0MN20151228102N21452");
conf.setProperty("pdfContent", pdfContent);
try {
String jobId = wc.run(conf);
} catch (Exception r) {
}
}
In this function were the oozie job is called,
we have specified various configurations
for oozie job to pick up and then mentioned
the project path and the application path
(which is the HDFS path of the oozie
workflow xml we have created and
uploaded ).
Once the oozieclient.run() is called, the
oozie job will be submitted. The
oozie.libpath will denote the jar files which
has to be included in the oozie workflow.
Oozie workflow has a couple of arguments
which could be set using conf.setProperty()
function (barcode, pdfContent)
As we have already defined the Oozie
workflow to execute a java job, the java job
will be called and the email send.

HBase Coprocessor – Assign to a Hbase
table
Target : Associate the HBase coprocessor to a HBase table
Steps :
1. Upload the Coprocessor jar file in a HDFS location
2. Start hbase shell and create the required tables.
3. Disable the table to which the coprocessor is being associated.
4. Use the below command to associate the HBase coprocessor to the HBase table :
hbase(main):049:0> alter ‘<HBase table name>', METHOD => 'table_att', 'coprocessor' =>
'hdfs:///user/oozie/JARS/incCoProc .jar|com. incCoProc ||master=log,family=family‘
while the coprocessor is getting associated to the HBase table, you will see that the regions are updated on the shell.
5. Enable the table to which the coprocessor was attached.
6. If you use describe ‘<HBase table name>‘ from hbase shell , you should see that the coprocessor is attached to the HBase table.
7. Try insertion new records to the HBase table to which the coprocessor was attached, you should see that a history is maintained
in the “log” table and emails send out to the configured user.
Master and family are
arguments being passed to
the Hbase Co processor
where the overridden start()
function will use it for other
operations.
So that every insertion on
the master table will be
inserted into the “log” table
with family name “family”

HBase Coprocessor – Unset from a Hbase
table
Target : Unset the HBase coprocessor from a HBase table
Steps :
1. Start hbase shell and create the required tables.
2. Disable the table to which the coprocessor is being associated.
3. Use the below command to associate the HBase coprocessor to the HBase table :
hbase(main):049:0> alter ‘<HBase table name>', METHOD => 'table_att_unset', NAME => 'coprocessor$1‘
4. Enable the table to which the coprocessor was attached.
5. Try insertion new records to the HBase table to which the coprocessor was attached, no actions will be triggered.

Thank You!
Hope you have got an introduction to HBase coprocessors and Oozie jobs and how they could be
related for various functionalities.

Hbase coprocessor with Oozie WF referencing 3rd Party jars

More Related Content

What's hot (20)

Similar to Hbase coprocessor with Oozie WF referencing 3rd Party jars (20)

Recently uploaded (20)

Hbase coprocessor with Oozie WF referencing 3rd Party jars