Automating hadoop jobs using rundeck

9/24/2018 Automating Hadoop Jobs Using Rundeck |
https://acadgild.com/blog/automating-hadoop-jobs-using-rundeck 1/12
Automating Hadoop Jobs Using Rundeck
kiran • April 6, 2017  4  738
All Categories Big Data Hadoop & Spark - Advanced
Rundeck is an open source software that helps in automating a set of
procedures. It provides features to automate a certain set of things. Rundeck
is developed on GitHub as a project called Rundeck SimplifyOps by the
Rundeck community.
Following are some of its exciting features:
100% Free Course On Big
Data Essentials
Subscribe to our blog and get access to this course
ABSOLUTELY FREE.
Name
Email
Phone
Submit

Web API
Distributed command execution
Pluggable execution system (SSH by default)
Multistep workflows
Job execution with on-demand or scheduled runs
Graphical web console for command and job execution
Role-based access control policy with support for LDAP/ActiveDirectory
History and auditing logs
Open integration with external host inventory tools
Command line interface tools
In our previous blog, we have shown how to schedule a Hadoop job in
Rundeck. In this blog, we will give you a demo of how to automate a
Hadoop/Hive/Pig job using Rundeck. This will allow your job to run on a
daily or even on a monthly basis.
We recommend our users to go through our previous blogs on Rundeck for
steps on installation and on how to schedule a Hadoop job.
Let us start with project creation. We will create a list of Hive queries in a file,
after which we will configure the job for it to run automatically every day.
To create a new project, Click On the New project, and provide the
necessary details, like project_name, description as shown in the screenshot
below.
Also, check the option Require File Exists in the Resource Model Source and
click on Save.

Now, scroll toward the end of the page and click on Create. Your Rundeck
project will get created and you will be able to see the project screen as
shown below.
Now click on Create Job at the Right corner and click on the New job. Fill
necessary details like Job name, description as shown in the screenshot
below:

For scheduling, come to the Workflow section and select options of your
choice. We have selected the following options:
If a step fails: Stop at the failed step
Strategy: Sequential
To provide a job or a query, go the Add step section, and select the option
Command.
Here, you need to provide the Hive query file containing a set of Hive
queries. Below is our hive query.
We will get our employee details inside the file emp.csv on a daily basis in
our HDFS. So, we are creating an hql file with the following content. We have
named it as hive_query.hql.
CREATE DATABASE IF IT DOES NOT EXIST employee;
use employee;

CREATE EXTERNAL TABLE IF IT DOES NOT EXIST employee_test (
id STRING,
first_name STRING,
last_name STRING,
email STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data inpath '/emp.csv' into table employee_test;
The command used to run this script in the command line is shown below:
hive -f hive_query.hql

After entering the command, click on Save, and if you want to run another
query after this, you can do that by adding another step.
After loading the data, I wish to count the number of employees present. We
can do this by using the following Hive query:

select count(*) from employee.employee_test
We will save this query in a file with the name hive_emp.hql. Towards the
end, we have added: >emp_cnt.txt. So, the above query will write the output
into the file: emp_cnt.txt. We will enter this query as the next step in our
workflow as shown in the screen shot below:
hive -f hive_emp.hql>emp_cnt.txt
For automating this job, select the following option:
Schedule or Run Repeated: Yes
You will get two kinds of automation: one is simple and the other is using
the Unix crontab. After selecting the necessary option, scroll to the last and
click on Create.
After clicking on Create, you will be redirected to the Job page. Beside your
job, you can see the countdown left to run it.
You can also see your job definition in the Definition tab as shown below:

In this demo, we have changed the time and we have only 2 min left to run
the job. After 2 min, this job will automatically run.
You can track the job status in the Activity for this job section below. Here,
we have four options: running, recent, failed, and by you.
After 2 min, in the running tab, you can see that your job is running.
Once your job gets deployed, you will get the deployment or execution
number, using which you can track the job running status and its complete
console output. In the screen shot displayed below, you can see that our job
is running.

And the deployment number is 21. In the recent tab, we can see the list of all
the succeeded and failed jobs. Now, we will check for the execution number
21 and then find the console output.
We can see that our job has run successfully. We can check for the output in
Log Output tab. Here, you can see the console output for both the jobs.

Now, we will check for the output in the file emp_cnt.txt.
In the above screen shot, you can see that there are 6000 employees in that
company till date. As scheduled, the same job will run automatically the next
day, and the count will be saved.
Once the job gets completed successfully, you can see the next deployment
countdown as shown in the screen shot below:
We hope this blog helped you in automating your Hadoop jobs using
Rundeck. Keep visiting our website, www.acadgild.com, for more updates on
Big data Training and other technologies.
Related

Scheduling Hadoop Jobs
using RUNDECK
December 26, 2016
In "All Categories"
Scheduling Hadoop Jobs
Using Jenkins
January 10, 2017
In "Big Data Hadoop &
Spark - Advanced"
Running A Map Reduce
Program Using Oozie
January 20, 2016
In "All Categories"
4 Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.
Reply
Reply
Reply
Reply
drasticdsemulatorinfo
April 16, 2017 at 1:42 PM
Do you mind if I quote a couple of your articles as long as I provide
credit
and sources back to your blog? My website is in the exact
same area of interest as yours and my users would truly benefit from
some
of the information you present here. Please let me know if this alright
with
you. Many thanks!
AcadGild
April 17, 2017 at 10:43 AM
Pls go ahead!
restorative justice in schools
April 18, 2017 at 9:13 AM
Hey, I think your blog might be having browser compatibility
issues. When I look at your blog site in Chrome,
it looks fine but when opening in Internet Explorer,
it has some overlapping. I just wanted to give you a quick heads up!
Other then that, terrific blog!
best golf simulators for home
April 18, 2017 at 7:18 PM
I don’t even know how I ended up right here, but I thought this submit
was
once great. I don’t recognise who you’re however definitely you
are going to a famous blogger should you are not already.
Cheers!

Automating hadoop jobs using rundeck

More Related Content

Similar to Automating hadoop jobs using rundeck (20)

Recently uploaded (20)

Automating hadoop jobs using rundeck