SlideShare a Scribd company logo
© 2018 Bloomberg Finance L.P. All rights reserved.
DataWorks Summit San Jose 2018
June 20, 2018
Artem Ervits – Hortonworks
Clay Baenziger – Bloomberg
Breathing New Life into Apache Oozie
with Apache Ambari Workflow Manager
© 2018 Bloomberg Finance L.P. All rights reserved.
Poll:
• Who here uses Oozie?
— In production?
With kerberos?
— Do you use HUE with Oozie?
— How many workflows have you in production?
1-10? 10-50? 50+?
— How many actions does the largest workflow contain?
1-10? 10-50? 50+?
— Do you use Oozie with (or want to)?
HBase? Spark? Python? Deployment Automation?
• Do you like XML?
— Do you have a favorite editor for Oozie workflows?
© 2018 Bloomberg Finance L.P. All rights reserved.
Open Source Workflow Managers
• Apache Airflow (Incubating)
• Luigi by Spotify
• Azkaban by LinkedIn
• (And of course) Apache Oozie
© 2018 Bloomberg Finance L.P. All rights reserved.
Introduction to Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
• Oozie workflow jobs are Directed Acyclic Graphs (DAGs) of actions.
• Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time and data availability.
• Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs as well as
system specific jobs out of the box.
• Oozie is a scalable, reliable and extensible system.
- Paraphrased from http://oozie.apache.org
Actions:
• Map/Reduce
• Hive
• Pig
• HDFS
• Java
• Shell
• Spark
• Sub-Workflow
• E-Mail
• Decision
• Fork
• Join
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Release Timeline
• 1.x released in 2010. Yahoo! project with two GitHub releases. Added support for workflow jobs.
• 2.x released in 2011. Still with Yahoo! with nine GitHub releases. Added support for coordinator jobs.
• 3.x released in 2013. Project under Apache. Added support for bundle jobs and HBase credentials.
• 4.x released in 2014. Added support for Hive/HCatalog, Spark integration and Oozie server high
availability.
• 5.0 released April 2018. Removes support for Hadoop 1, adds support for Hadoop 3, YARN AM instead
of MR launcher, new actions, code clean up.
- Adopted from: Apache Oozie by
Mohammad Kamrul Islam and Aravind Srinivasan
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Complaints
• Launcher jobs as map tasks
• Dated UI
• Confusing object model & XML – workflows, coordinators, bundles
• Complicated setup
• DAG visualization
• SLA alerting
• Fine grained authorization
• Easy access to log files
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Complaints Improvements
• Launcher jobs as map tasks – solved by Oozie 5.0.0, OOZIE-1770
• Dated UI – OOZIE-2683, targeted for Oozie 5.X (Hue and Workflow Manager today)
• Confusing object model & XML – jobs API, patch available, targeted for 5.1, OOZIE-2339
• Complicated setup – can deploy with embedded Jetty in Oozie 5.0.0, OOZIE-2666
• DAG visualization – solved by Oozie 5.0.0, OOZIE-2406
• SLA alerting – since Oozie 4.0.0, OOZIE-1294
• Fine grained authorization – targeted for Oozie 5.X, OOZIE-3196
• Easy access to log files – solved by Diagnostic Tool in Oozie 5.0.0, OOZIE-2296
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie UI – React Mock-Up - OOZIE-3283
• Workflows
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Launcher – Prior to Release 5.0
• MR launcher job
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Launcher – Release 5.0
• OYA: OOZIE-1770: Create Oozie Application Master for YARN
— Removes MR launcher job
• Design Doc
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Documentation – Before Release 5.0 and After
Documentation redesign
OOZIE-3163: Improve documentation rendering: use fluido skin and better config
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Workflow Visualization – Prior to 5.0 and After
Jung GraphViz
OOZIE-2406: Completely rewrite Graph Generator code
© 2018 Bloomberg Finance L.P. All rights reserved.
Oozie Fluent Job API – Apache Oozie 5.1
OOZIE-2339: Provide an API for writing jobs based on the XSD schemas
© 2018 Bloomberg Finance L.P. All rights reserved.
Apache Ambari
Ambari Provides:
• Provisioning of a Hadoop Cluster
• Management of a Hadoop Cluster
• Monitoring of a Hadoop Cluster
— A Metrics System for metrics collection
— An Alert Framework
— A dashboard for monitoring the Hadoop cluster
-Paraphrased from http://ambari.apache.org
© 2018 Bloomberg Finance L.P. All rights reserved.
Ambari Views
• Ambari Views ”offer a systematic way to plug-in UI capabilities to surface custom
visualization, management and monitoring features in Ambari Web. A "view" is a way of
extending Ambari that allows 3rd parties to plug in new resource types along with the
APIs, providers and UI to support them. In other words, a view is an application that is
deployed into the Ambari container.”
• Key takeaways:
— One does not need an Ambari managed (administrated) cluster
— Third parties can build views packages to run in the Ambari framework too
— Major views available:
(YARN) Capacity Scheduler, (HDFS) Files, HAWQ, Hive, Pig, Storm, Tez, (YARN
ATS) Jobs, (Oozie) Workflow Manager
• Alternatives: Cloudera Hue, bespoke applications
© 2018 Bloomberg Finance L.P. All rights reserved.
Workflow Manager – Motivation
• Oozie workflows are defined in XML – too verbose
— Provide GUI workflow builder and editor
— Reduce possibility of user introduced errors
— Provide browser based workflow manager
• Integration with File Browser
— Includes S3 support
— Can replace existing
Oozie web UI
• Oozie is hard-coded to
display only 25 actions
— WFM doesn’t have this
limit; tested with 300+
action nodes
• Oozie is scalable
— Can scale WFM by
standing-up multiple
Ambari Views servers
© 2018 Bloomberg Finance L.P. All rights reserved.
Workflow Manager – Workflow Editor Example
Workflow Manager:
• Available as an Ambari View
• Enables visual editing of Oozie workflows
• Integrated with file browser
• Reduces user input errors
• Minimal input required
© 2018 Bloomberg Finance L.P. All rights reserved.
Workflow Manager – Execution View Example
• Integrated Dashboard with Workflow Manager View
• Manage Oozie jobs
• Drill down to logs
© 2018 Bloomberg Finance L.P. All rights reserved.
Workflow Manager – Workflow Design Component
© 2018 Bloomberg Finance L.P. All rights reserved.
Workflow Manager – Workflow Dashboard Component
Good Documentation: HDP 2.6 – Workflow Manager Basics
© 2018 Bloomberg Finance L.P. All rights reserved.
Art of Possible
• Scheduling “non-traditional” Hadoop workflows
— Schedule SQL maintenance operations
— Launch SQL Server on Linux in Docker on YARN for tests
— Warming Caches (HBase, LLAP, etc.)
• Administrative Tasks
— Log clean-up
— Clean-up crashed/abandoned Hive temporary data
— HBase management
© 2018 Bloomberg Finance L.P. All rights reserved.
DataWorks Summit San Jose 2018
• Setup Oozie – Server and Workflows
• Data Definition – Tables, ACLs
• Compactions – Operational
Workflow Manager Examples with
HBase
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase – Setup
Oozie needs HBase Configuration:
• Oozie Server Code (to support HBase delegation tokens)
— In libexec (see Server JARs list)
— In oozie-site.xml
<name>oozie.credentials.credentialclasses</name>
<value>hbase=org.apache.oozie.action.hadoop.HbaseCredentials,…</value>
</name>
• Client Workflow Code:
— Add to workflow.xml:
<credentials>
<credential name=”myhbase_creds” type=”hbase”>
[…]
</credential>
</credentials>
— All your normal HBase security settings in the credential section
• Server JARs:
(Copy the following to Oozie’s libexec)
— hbase-common.jar
— hbase-client.jar
— hbase-server.jar
— hbase-protocol.jar
— hbase-hadoop2-compat.jar
© 2018 Bloomberg Finance L.P. All rights reserved.
create_my_table.rb:
tables = list
tables.select { |table|
table.eql?('my_table') }
if tables.empty?
create 'my_table',
{NAME => 'my_col'}
end
exit
HBase – Data Definition
HBase Shell:
<action name="HBASE-Shell" cred="hbase_creds">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>hbase</exec>
<argument>shell</argument>
<argument>-n</argument>
<argument>create_my_table.rb</argument>
</shell>
<ok to="do_more_things"/>
<error to="fail"/>
</action>
© 2018 Bloomberg Finance L.P. All rights reserved.
HBase – Compactions
HBASE-19528: Major Compaction Tool
• Automatically scales compaction to selected number of servers
• Requires read ability to /hbase
usage: MajorCompactor [-cf <arg>] [-dryRun] -servers <arg> -table <arg>
[...]
Usage instructions
-cf <arg> column families: comma separated eg: a,b,c
-dryRun Dry run, will just output a list of regions that
require compaction based on parameters passed
-minModTime <arg> Compact if store files have
modification time < minModTime
-servers <arg> Concurrent servers compacting
-table <arg> table name
...
© 2018 Bloomberg Finance L.P. All rights reserved.
More Resources
• Apache Oozie Mailing Lists: http://oozie.apache.org/mail-lists.html
• Artem’s Oozie Resources:
—12 Part Series on WFM: http://bit.ly/2syKUIh
— Oozie Examples: https://github.com/dbist/oozie-examples
• Clay’s Past Oozie Presentations:
— Code Deployment via Oozie: Apache BigData http://bit.ly/2sP2qbj
— HBase Multi-Tenancy with Oozie: DataWorks Summit http://bit.ly/2rw7FIR
© 2018 Bloomberg Finance L.P. All rights reserved.
DataWorks Summit San Jose 2018
Demo!
© 2018 Bloomberg Finance L.P. All rights reserved.
DataWorks Summit San Jose 2018
Questions?

More Related Content

PDF
What is new in Apache Hive 3.0?
PPTX
IoT with Apache MXNet and Apache NiFi and MiniFi
PPTX
High throughput data replication over RAFT
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
PPTX
Hive 3 - a new horizon
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Transactional operations in Apache Hive: present and future
PPTX
Quality for the Hadoop Zoo
What is new in Apache Hive 3.0?
IoT with Apache MXNet and Apache NiFi and MiniFi
High throughput data replication over RAFT
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive 3 - a new horizon
Apache Hive 2.0: SQL, Speed, Scale
Transactional operations in Apache Hive: present and future
Quality for the Hadoop Zoo

What's hot (20)

PPTX
Accelerating query processing
PPTX
Running Enterprise Workloads in the Cloud
PPTX
Meet HBase 2.0 and Phoenix-5.0
PPTX
The Future of Apache Ambari
PPTX
LLAP: long-lived execution in Hive
PPTX
Enabling real interactive BI on Hadoop
PPTX
Apache Hadoop YARN: state of the union
PPTX
HBase coprocessors, Uses, Abuses, Solutions
PDF
Multitenancy At Bloomberg - HBase and Oozie
PPTX
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
PPTX
Hadoop Operations - Past, Present, and Future
PPTX
Database as a Service - Tutorial @ICDE 2010
PDF
An Apache Hive Based Data Warehouse
PDF
Meet HBase 2.0 and Phoenix-5.0
PPTX
An Apache Hive Based Data Warehouse
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
PDF
Hive 3 a new horizon
PPTX
Schema Registry - Set Your Data Free
PPTX
Running Enterprise Workloads in the Cloud
Accelerating query processing
Running Enterprise Workloads in the Cloud
Meet HBase 2.0 and Phoenix-5.0
The Future of Apache Ambari
LLAP: long-lived execution in Hive
Enabling real interactive BI on Hadoop
Apache Hadoop YARN: state of the union
HBase coprocessors, Uses, Abuses, Solutions
Multitenancy At Bloomberg - HBase and Oozie
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Hadoop Operations - Past, Present, and Future
Database as a Service - Tutorial @ICDE 2010
An Apache Hive Based Data Warehouse
Meet HBase 2.0 and Phoenix-5.0
An Apache Hive Based Data Warehouse
Hive edw-dataworks summit-eu-april-2017
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Hive 3 a new horizon
Schema Registry - Set Your Data Free
Running Enterprise Workloads in the Cloud
Ad

Similar to Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager (20)

PDF
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
PDF
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
PDF
Workflow Engines for Hadoop
PPTX
APIdays 2016 - The State of Web API Languages
PPTX
Oozie meetup - HA
PDF
Peteris Arajs - Where is my data
PPTX
Lessons learned on the Azure API Stewardship Journey.pptx
PPTX
API Platform Cloud Service best practice - OOW17
PPT
Modernizing an Existing SOA-based Architecture with APIs
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
PDF
Running SOA in the Cloud: SOA CS for SOA Suite Customers
PPTX
First Look at Azure Logic Apps (BAUG)
PPTX
Custom Development in SharePoint – What are my options now?
PPTX
Building APIs with Apigee Edge and Microsoft Azure
PPTX
SOA - From Webservices to APIs
PDF
Octo API-days 2015
PDF
Top 7 wrong common beliefs about Enterprise API implementation
PPTX
Add Apache Web Server to your Unified Monitoring Toolkit
PDF
Web jobs, Azure Functions and Serverless Computing
PPTX
Google Cloud Platform, Compute Engine, and App Engine
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Workflow Engines for Hadoop
APIdays 2016 - The State of Web API Languages
Oozie meetup - HA
Peteris Arajs - Where is my data
Lessons learned on the Azure API Stewardship Journey.pptx
API Platform Cloud Service best practice - OOW17
Modernizing an Existing SOA-based Architecture with APIs
One Click Hadoop Clusters - Anywhere (Using Docker)
Running SOA in the Cloud: SOA CS for SOA Suite Customers
First Look at Azure Logic Apps (BAUG)
Custom Development in SharePoint – What are my options now?
Building APIs with Apigee Edge and Microsoft Azure
SOA - From Webservices to APIs
Octo API-days 2015
Top 7 wrong common beliefs about Enterprise API implementation
Add Apache Web Server to your Unified Monitoring Toolkit
Web jobs, Azure Functions and Serverless Computing
Google Cloud Platform, Compute Engine, and App Engine
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager

  • 1. © 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit San Jose 2018 June 20, 2018 Artem Ervits – Hortonworks Clay Baenziger – Bloomberg Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
  • 2. © 2018 Bloomberg Finance L.P. All rights reserved. Poll: • Who here uses Oozie? — In production? With kerberos? — Do you use HUE with Oozie? — How many workflows have you in production? 1-10? 10-50? 50+? — How many actions does the largest workflow contain? 1-10? 10-50? 50+? — Do you use Oozie with (or want to)? HBase? Spark? Python? Deployment Automation? • Do you like XML? — Do you have a favorite editor for Oozie workflows?
  • 3. © 2018 Bloomberg Finance L.P. All rights reserved. Open Source Workflow Managers • Apache Airflow (Incubating) • Luigi by Spotify • Azkaban by LinkedIn • (And of course) Apache Oozie
  • 4. © 2018 Bloomberg Finance L.P. All rights reserved. Introduction to Oozie • Oozie is a workflow scheduler system to manage Apache Hadoop jobs. • Oozie workflow jobs are Directed Acyclic Graphs (DAGs) of actions. • Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time and data availability. • Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs as well as system specific jobs out of the box. • Oozie is a scalable, reliable and extensible system. - Paraphrased from http://oozie.apache.org Actions: • Map/Reduce • Hive • Pig • HDFS • Java • Shell • Spark • Sub-Workflow • E-Mail • Decision • Fork • Join
  • 5. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Release Timeline • 1.x released in 2010. Yahoo! project with two GitHub releases. Added support for workflow jobs. • 2.x released in 2011. Still with Yahoo! with nine GitHub releases. Added support for coordinator jobs. • 3.x released in 2013. Project under Apache. Added support for bundle jobs and HBase credentials. • 4.x released in 2014. Added support for Hive/HCatalog, Spark integration and Oozie server high availability. • 5.0 released April 2018. Removes support for Hadoop 1, adds support for Hadoop 3, YARN AM instead of MR launcher, new actions, code clean up. - Adopted from: Apache Oozie by Mohammad Kamrul Islam and Aravind Srinivasan
  • 6. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Complaints • Launcher jobs as map tasks • Dated UI • Confusing object model & XML – workflows, coordinators, bundles • Complicated setup • DAG visualization • SLA alerting • Fine grained authorization • Easy access to log files
  • 7. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Complaints Improvements • Launcher jobs as map tasks – solved by Oozie 5.0.0, OOZIE-1770 • Dated UI – OOZIE-2683, targeted for Oozie 5.X (Hue and Workflow Manager today) • Confusing object model & XML – jobs API, patch available, targeted for 5.1, OOZIE-2339 • Complicated setup – can deploy with embedded Jetty in Oozie 5.0.0, OOZIE-2666 • DAG visualization – solved by Oozie 5.0.0, OOZIE-2406 • SLA alerting – since Oozie 4.0.0, OOZIE-1294 • Fine grained authorization – targeted for Oozie 5.X, OOZIE-3196 • Easy access to log files – solved by Diagnostic Tool in Oozie 5.0.0, OOZIE-2296
  • 8. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie UI – React Mock-Up - OOZIE-3283 • Workflows
  • 9. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Launcher – Prior to Release 5.0 • MR launcher job
  • 10. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Launcher – Release 5.0 • OYA: OOZIE-1770: Create Oozie Application Master for YARN — Removes MR launcher job • Design Doc
  • 11. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Documentation – Before Release 5.0 and After Documentation redesign OOZIE-3163: Improve documentation rendering: use fluido skin and better config
  • 12. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Workflow Visualization – Prior to 5.0 and After Jung GraphViz OOZIE-2406: Completely rewrite Graph Generator code
  • 13. © 2018 Bloomberg Finance L.P. All rights reserved. Oozie Fluent Job API – Apache Oozie 5.1 OOZIE-2339: Provide an API for writing jobs based on the XSD schemas
  • 14. © 2018 Bloomberg Finance L.P. All rights reserved. Apache Ambari Ambari Provides: • Provisioning of a Hadoop Cluster • Management of a Hadoop Cluster • Monitoring of a Hadoop Cluster — A Metrics System for metrics collection — An Alert Framework — A dashboard for monitoring the Hadoop cluster -Paraphrased from http://ambari.apache.org
  • 15. © 2018 Bloomberg Finance L.P. All rights reserved. Ambari Views • Ambari Views ”offer a systematic way to plug-in UI capabilities to surface custom visualization, management and monitoring features in Ambari Web. A "view" is a way of extending Ambari that allows 3rd parties to plug in new resource types along with the APIs, providers and UI to support them. In other words, a view is an application that is deployed into the Ambari container.” • Key takeaways: — One does not need an Ambari managed (administrated) cluster — Third parties can build views packages to run in the Ambari framework too — Major views available: (YARN) Capacity Scheduler, (HDFS) Files, HAWQ, Hive, Pig, Storm, Tez, (YARN ATS) Jobs, (Oozie) Workflow Manager • Alternatives: Cloudera Hue, bespoke applications
  • 16. © 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Motivation • Oozie workflows are defined in XML – too verbose — Provide GUI workflow builder and editor — Reduce possibility of user introduced errors — Provide browser based workflow manager • Integration with File Browser — Includes S3 support — Can replace existing Oozie web UI • Oozie is hard-coded to display only 25 actions — WFM doesn’t have this limit; tested with 300+ action nodes • Oozie is scalable — Can scale WFM by standing-up multiple Ambari Views servers
  • 17. © 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Workflow Editor Example Workflow Manager: • Available as an Ambari View • Enables visual editing of Oozie workflows • Integrated with file browser • Reduces user input errors • Minimal input required
  • 18. © 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Execution View Example • Integrated Dashboard with Workflow Manager View • Manage Oozie jobs • Drill down to logs
  • 19. © 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Workflow Design Component
  • 20. © 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Workflow Dashboard Component Good Documentation: HDP 2.6 – Workflow Manager Basics
  • 21. © 2018 Bloomberg Finance L.P. All rights reserved. Art of Possible • Scheduling “non-traditional” Hadoop workflows — Schedule SQL maintenance operations — Launch SQL Server on Linux in Docker on YARN for tests — Warming Caches (HBase, LLAP, etc.) • Administrative Tasks — Log clean-up — Clean-up crashed/abandoned Hive temporary data — HBase management
  • 22. © 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit San Jose 2018 • Setup Oozie – Server and Workflows • Data Definition – Tables, ACLs • Compactions – Operational Workflow Manager Examples with HBase
  • 23. © 2018 Bloomberg Finance L.P. All rights reserved. HBase – Setup Oozie needs HBase Configuration: • Oozie Server Code (to support HBase delegation tokens) — In libexec (see Server JARs list) — In oozie-site.xml <name>oozie.credentials.credentialclasses</name> <value>hbase=org.apache.oozie.action.hadoop.HbaseCredentials,…</value> </name> • Client Workflow Code: — Add to workflow.xml: <credentials> <credential name=”myhbase_creds” type=”hbase”> […] </credential> </credentials> — All your normal HBase security settings in the credential section • Server JARs: (Copy the following to Oozie’s libexec) — hbase-common.jar — hbase-client.jar — hbase-server.jar — hbase-protocol.jar — hbase-hadoop2-compat.jar
  • 24. © 2018 Bloomberg Finance L.P. All rights reserved. create_my_table.rb: tables = list tables.select { |table| table.eql?('my_table') } if tables.empty? create 'my_table', {NAME => 'my_col'} end exit HBase – Data Definition HBase Shell: <action name="HBASE-Shell" cred="hbase_creds"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>hbase</exec> <argument>shell</argument> <argument>-n</argument> <argument>create_my_table.rb</argument> </shell> <ok to="do_more_things"/> <error to="fail"/> </action>
  • 25. © 2018 Bloomberg Finance L.P. All rights reserved. HBase – Compactions HBASE-19528: Major Compaction Tool • Automatically scales compaction to selected number of servers • Requires read ability to /hbase usage: MajorCompactor [-cf <arg>] [-dryRun] -servers <arg> -table <arg> [...] Usage instructions -cf <arg> column families: comma separated eg: a,b,c -dryRun Dry run, will just output a list of regions that require compaction based on parameters passed -minModTime <arg> Compact if store files have modification time < minModTime -servers <arg> Concurrent servers compacting -table <arg> table name ...
  • 26. © 2018 Bloomberg Finance L.P. All rights reserved. More Resources • Apache Oozie Mailing Lists: http://oozie.apache.org/mail-lists.html • Artem’s Oozie Resources: —12 Part Series on WFM: http://bit.ly/2syKUIh — Oozie Examples: https://github.com/dbist/oozie-examples • Clay’s Past Oozie Presentations: — Code Deployment via Oozie: Apache BigData http://bit.ly/2sP2qbj — HBase Multi-Tenancy with Oozie: DataWorks Summit http://bit.ly/2rw7FIR
  • 27. © 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit San Jose 2018 Demo!
  • 28. © 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit San Jose 2018 Questions?