SlideShare a Scribd company logo
Andreas Neumann Oozie – Workflow for Hadoop
Who Am I? Dr. Andreas Neumann Software Architect, Yahoo! anew <at> yahoo-inc <dot> com At Yahoo! (2008-present) Grid architecture Content Platform Research At IBM (2000-2008) Database (DB2) Development Enterprise Search
Oozie Overview Main Features Execute and monitor workflows in Hadoop Periodic scheduling of workflows Trigger execution by data availability HTTP and command line interface + Web console Adoption ~100 users on mailing list since launch on github In production at Yahoo!, running >200K jobs/day
Oozie Workflow Overview Purpose:  Execution of workflows on the Grid Oozie Hadoop/Pig/HDFS DB WS  API Tomcat web-app
Oozie Workflow Directed Acyclic Graph of Jobs start Java Main M/R streaming job decision fork Pig job M/R job join OK OK OK OK end Java Main FS job OK OK ENOUGH MORE
Oozie Workflow Example <workflow-app name=’wordcount-wf’> <start to=‘wordcount’/> <action name=’wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value> ${inputDir} </value> </property> <property> <name>mapred.output.dir</name> <value> ${outputDir} </value> </property>  </configuration> </map-reduce> <ok to=’end'/> <error to=’kill'/> </action> <kill name=‘kill’/> <end name=‘end’/> </workflow-app> OK Start Error Start M-R wordcount End Kill
Oozie Workflow Nodes Control Flow: start/end/kill decision fork/join Actions: map-reduce pig hdfs sub-workflow java – run custom Java code
Oozie Workflow Application A HDFS directory containing: Definition file:  workflow.xml Configuration file:  config-default.xml App files:  lib/  directory with JAR and SO files Pig Scripts
Running an Oozie Workflow Job Application Deployment: $ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount   Workflow Job Parameters: $ cat job.properties oozie.wf.application.path = hdfs://bar.com:9000/usr/ abc /wordcount input  = /usr/ abc /input-data   output  = /user/ abc /output-data Job Execution: $  oozie job –run -config job.properties job: 1-20090525161321-oozie-xyz-W
Monitoring an Oozie Workflow Job Workflow Job Status: $  oozie job -info 1-20090525161321-oozie-xyz-W ------------------------------------------------------------------------ Workflow Name :  wordcount-wf App Path    :  hdfs://bar.com:9000/usr/abc/wordcount Status  :  RUNNING … Workflow Job Log: $  oozie job –log 1-20090525161321-oozie-xyz-W Workflow Job Definition: $  oozie job –definition 1-20090525161321-oozie-xyz-W
Oozie Coordinator Overview Purpose:  Coordinated execution of workflows on the Grid Workflows are backwards compatible  Hadoop Tomcat Oozie Client Oozie  Workflow WS API Oozie  Coordinator Check  Data Availability
Oozie Application Lifecycle Coordinator Job Oozie Coordinator Engine Oozie Workflow Engine 1*f Action 1 WF 2*f Action 2 WF N*f … … Action N … … WF 0*f Action 0 WF action create action start start end A B C
Use Case 1: Time Triggers Execute your workflow every 15 minutes (CRON) 00:15 00:30 00:45 01:00
Example 1: Run Workflow every 15 mins <coordinator-app name=“coord1”  start=&quot;2009-01-08T00:00Z&quot;  end=&quot;2010-01-01T00:00Z&quot;  frequency=” 15 &quot;  xmlns=&quot;uri:oozie:coordinator:0.1&quot;> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>key1</name><value>value1</value> </property> </configuration> </workflow> </action>  </coordinator-app>
Use Case 2: Time and Data Triggers Materialize your workflow every hour, but  only run them when the input data is ready. Input Data  Exists? 01:00 02:00 03:00 04:00 Hadoop
Example 2: Data Triggers <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>  <datasets> <dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name=“ inputLogs ” dataset=&quot; logs &quot;> <instance> ${current(0)} </instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </configuration> </workflow> </action>  </coordinator-app>
Use Case 3: Rolling Windows Access 15 minute datasets and roll them up into hourly datasets 00:15 00:30 00:45 01:00 01:00 01:15 01:30 01:45 02:00 02:00
Example 3: Rolling Windows <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>  <datasets> <dataset name=&quot; logs &quot; frequency=“ 15 ” initial-instance=&quot;2009-01-01T00:00Z&quot;> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE} </uri-template> </dataset> </datasets> <input-events> <data-in name=“ inputLogs ” dataset=&quot;logs&quot;> <start-instance> ${current(-3)} </start-instance> <end-instance> ${current(0)} </end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </configuration> </workflow> </action>  </coordinator-app>
Use Case 4: Sliding Windows Access last 24 hours of data, and roll them up every hour. 01:00 02:00 03:00 24:00 24:00 … 02:00 03:00 04:00 +1 day 01:00 +1 day 01:00 … 03:00 04:00 05:00 +1 day 02:00 +1 day 02:00 …
Example 4: Sliding Windows <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…>  <datasets> <dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name=“ inputLogs ” dataset=&quot;logs&quot;> <start-instance> ${current(-23)} </start-instance> <end-instance> ${current(0)} </end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </configuration> </workflow> </action>  </coordinator-app>
Oozie Coordinator Application A HDFS directory containing: Definition file:  coordinator.xml Configuration file:  coord-config-default.xml
Running an Oozie Coordinator Job Application Deployment: $ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job   Coordinator Job Parameters: $ cat job.properties oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job Job Execution: $ oozie job –run -config job.properties job: 1-20090525161321-oozie-xyz-C
Monitoring an Oozie Coordinator Job Coordinator Job Status: $ oozie job -info 1-20090525161321-oozie-xyz-C ------------------------------------------------------------------------ Job Name  :  wordcount-coord App Path :  hdfs://bar.com:9000/usr/abc/coord_job Status  :  RUNNING … Coordinator Job Log: $ oozie job –log 1-20090525161321-oozie-xyz-C Coordinator Job Definition: $ oozie job –definition 1-20090525161321-oozie-xyz-C
Oozie Web Console: List Jobs
Oozie Web Console: Job Details
Oozie Web Console: Failed Action
Oozie Web Console: Error Messages
What’s Next For Oozie? New Features More out-of-the-box actions: distcp, hive, … Authentication framework Authenticate a client with Oozie Authenticate an Oozie workflow with downstream services Bundles: Manage multiple coordinators together Asynchronous data sets and coordinators Scalability Memory footprint Data notification instead of polling Integration with Howl ( http://github.com/yahoo/howl )
We Need You! Oozie is Open Source Source:  http://github.com/yahoo/oozie Docs:  http://yahoo.github.com/oozie List:  http://tech.groups.yahoo.com/group/Oozie-users/ To Contribute: https://github.com/yahoo/oozie/wiki/How-To-Contribute
Thank You! github.com/yahoo/oozie/wiki/How-To-Contribute

More Related Content

PDF
October 2013 HUG: Oozie 4.x
PPTX
Oozie or Easy: Managing Hadoop Workloads the EASY Way
PPTX
Everything you wanted to know, but were afraid to ask about Oozie
PPTX
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
PDF
Oozie sweet
PPTX
July 2012 HUG: Overview of Oozie Qualification Process
PDF
Oozie Summit 2011
PPTX
Oozie towards zero downtime
October 2013 HUG: Oozie 4.x
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Everything you wanted to know, but were afraid to ask about Oozie
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
Oozie sweet
July 2012 HUG: Overview of Oozie Qualification Process
Oozie Summit 2011
Oozie towards zero downtime

What's hot (20)

PDF
Oozie HUG May12
PPTX
Oozie &amp; sqoop by pradeep
PPTX
Apache Oozie
PPTX
Hadoop Oozie
PPTX
Clogeny Hadoop ecosystem - an overview
PPTX
Hadoop HDFS
PPTX
HiveServer2
PDF
Pluggable Databases: What they will break and why you should use them anyway!
PDF
Apache Hive Hook
PDF
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
PDF
Introduction to Chef
PPT
Hadoop @ Yahoo! - Internet Scale Data Processing
PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PDF
Data Engineering with Spring, Hadoop and Hive
PDF
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
PDF
Introduction to Apache Hive
KEY
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
PPTX
October 2014 HUG : Oozie HA
PDF
Future of Web Apps: Google Gears
PDF
監査ログをもっと身近に!〜統合監査のすすめ〜
Oozie HUG May12
Oozie &amp; sqoop by pradeep
Apache Oozie
Hadoop Oozie
Clogeny Hadoop ecosystem - an overview
Hadoop HDFS
HiveServer2
Pluggable Databases: What they will break and why you should use them anyway!
Apache Hive Hook
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
Introduction to Chef
Hadoop @ Yahoo! - Internet Scale Data Processing
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Data Engineering with Spring, Hadoop and Hive
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Introduction to Apache Hive
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
October 2014 HUG : Oozie HA
Future of Web Apps: Google Gears
監査ログをもっと身近に!〜統合監査のすすめ〜
Ad

Viewers also liked (10)

PPTX
Oozie at Yahoo
PPTX
PPTX
Oozie meetup - HA
PPTX
Building and managing complex dependencies pipeline using Apache Oozie
PPTX
A Basic Hive Inspection
PDF
Hive tuning
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PPTX
A Multi Colored YARN
PPTX
August 2016 HUG: Recent development in Apache Oozie
Oozie at Yahoo
Oozie meetup - HA
Building and managing complex dependencies pipeline using Apache Oozie
A Basic Hive Inspection
Hive tuning
HIVE: Data Warehousing & Analytics on Hadoop
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
A Multi Colored YARN
August 2016 HUG: Recent development in Apache Oozie
Ad

Similar to Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N (20)

PPT
Workflow on Hadoop Using Oozie__HadoopSummit2010
PDF
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Apache ooziehhwkkwksjshshjjwjwisisis.pptx
PPTX
Apache Oozie
PDF
Oozie @ Riot Games
PPTX
Apache Oozie Workflow Scheduler - Module 10
PPTX
Process Scheduling on Hadoop at Expedia
PDF
Oozie Hug May 2011
PPTX
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
PDF
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
PDF
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
PDF
Apache Oozie The Workflow Scheduler for Hadoop 1st Edition Mohammad Kamrul Islam
PDF
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
PDF
oozieee.pdf
PDF
Oozie hugnov11
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PPTX
Hadoop - Integration Patterns and Practices__HadoopSummit2010
PPTX
Apache Falcon - Data Management Platform For Hadoop
PDF
Nov 2011 HUG: Oozie
Workflow on Hadoop Using Oozie__HadoopSummit2010
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Apache ooziehhwkkwksjshshjjwjwisisis.pptx
Apache Oozie
Oozie @ Riot Games
Apache Oozie Workflow Scheduler - Module 10
Process Scheduling on Hadoop at Expedia
Oozie Hug May 2011
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Apache Oozie The Workflow Scheduler for Hadoop 1st Edition Mohammad Kamrul Islam
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
oozieee.pdf
Oozie hugnov11
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Apache Falcon - Data Management Platform For Hadoop
Nov 2011 HUG: Oozie

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

  • 1. Andreas Neumann Oozie – Workflow for Hadoop
  • 2. Who Am I? Dr. Andreas Neumann Software Architect, Yahoo! anew <at> yahoo-inc <dot> com At Yahoo! (2008-present) Grid architecture Content Platform Research At IBM (2000-2008) Database (DB2) Development Enterprise Search
  • 3. Oozie Overview Main Features Execute and monitor workflows in Hadoop Periodic scheduling of workflows Trigger execution by data availability HTTP and command line interface + Web console Adoption ~100 users on mailing list since launch on github In production at Yahoo!, running >200K jobs/day
  • 4. Oozie Workflow Overview Purpose: Execution of workflows on the Grid Oozie Hadoop/Pig/HDFS DB WS API Tomcat web-app
  • 5. Oozie Workflow Directed Acyclic Graph of Jobs start Java Main M/R streaming job decision fork Pig job M/R job join OK OK OK OK end Java Main FS job OK OK ENOUGH MORE
  • 6. Oozie Workflow Example <workflow-app name=’wordcount-wf’> <start to=‘wordcount’/> <action name=’wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value> ${inputDir} </value> </property> <property> <name>mapred.output.dir</name> <value> ${outputDir} </value> </property> </configuration> </map-reduce> <ok to=’end'/> <error to=’kill'/> </action> <kill name=‘kill’/> <end name=‘end’/> </workflow-app> OK Start Error Start M-R wordcount End Kill
  • 7. Oozie Workflow Nodes Control Flow: start/end/kill decision fork/join Actions: map-reduce pig hdfs sub-workflow java – run custom Java code
  • 8. Oozie Workflow Application A HDFS directory containing: Definition file: workflow.xml Configuration file: config-default.xml App files: lib/ directory with JAR and SO files Pig Scripts
  • 9. Running an Oozie Workflow Job Application Deployment: $ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount Workflow Job Parameters: $ cat job.properties oozie.wf.application.path = hdfs://bar.com:9000/usr/ abc /wordcount input = /usr/ abc /input-data output = /user/ abc /output-data Job Execution: $ oozie job –run -config job.properties job: 1-20090525161321-oozie-xyz-W
  • 10. Monitoring an Oozie Workflow Job Workflow Job Status: $ oozie job -info 1-20090525161321-oozie-xyz-W ------------------------------------------------------------------------ Workflow Name : wordcount-wf App Path : hdfs://bar.com:9000/usr/abc/wordcount Status : RUNNING … Workflow Job Log: $ oozie job –log 1-20090525161321-oozie-xyz-W Workflow Job Definition: $ oozie job –definition 1-20090525161321-oozie-xyz-W
  • 11. Oozie Coordinator Overview Purpose: Coordinated execution of workflows on the Grid Workflows are backwards compatible Hadoop Tomcat Oozie Client Oozie Workflow WS API Oozie Coordinator Check Data Availability
  • 12. Oozie Application Lifecycle Coordinator Job Oozie Coordinator Engine Oozie Workflow Engine 1*f Action 1 WF 2*f Action 2 WF N*f … … Action N … … WF 0*f Action 0 WF action create action start start end A B C
  • 13. Use Case 1: Time Triggers Execute your workflow every 15 minutes (CRON) 00:15 00:30 00:45 01:00
  • 14. Example 1: Run Workflow every 15 mins <coordinator-app name=“coord1” start=&quot;2009-01-08T00:00Z&quot; end=&quot;2010-01-01T00:00Z&quot; frequency=” 15 &quot; xmlns=&quot;uri:oozie:coordinator:0.1&quot;> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>key1</name><value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>
  • 15. Use Case 2: Time and Data Triggers Materialize your workflow every hour, but only run them when the input data is ready. Input Data Exists? 01:00 02:00 03:00 04:00 Hadoop
  • 16. Example 2: Data Triggers <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…> <datasets> <dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name=“ inputLogs ” dataset=&quot; logs &quot;> <instance> ${current(0)} </instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </configuration> </workflow> </action> </coordinator-app>
  • 17. Use Case 3: Rolling Windows Access 15 minute datasets and roll them up into hourly datasets 00:15 00:30 00:45 01:00 01:00 01:15 01:30 01:45 02:00 02:00
  • 18. Example 3: Rolling Windows <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…> <datasets> <dataset name=&quot; logs &quot; frequency=“ 15 ” initial-instance=&quot;2009-01-01T00:00Z&quot;> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE} </uri-template> </dataset> </datasets> <input-events> <data-in name=“ inputLogs ” dataset=&quot;logs&quot;> <start-instance> ${current(-3)} </start-instance> <end-instance> ${current(0)} </end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </configuration> </workflow> </action> </coordinator-app>
  • 19. Use Case 4: Sliding Windows Access last 24 hours of data, and roll them up every hour. 01:00 02:00 03:00 24:00 24:00 … 02:00 03:00 04:00 +1 day 01:00 +1 day 01:00 … 03:00 04:00 05:00 +1 day 02:00 +1 day 02:00 …
  • 20. Example 4: Sliding Windows <coordinator-app name=“coord1” frequency=“ ${1*HOURS} ”…> <datasets> <dataset name=&quot; logs &quot; frequency=“ ${1*HOURS} ” initial-instance=&quot;2009-01-01T00:00Z&quot;> <uri-template> hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name=“ inputLogs ” dataset=&quot;logs&quot;> <start-instance> ${current(-23)} </start-instance> <end-instance> ${current(0)} </end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/ abc /logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value> ${dataIn(‘inputLogs’)} </value> </property> </configuration> </workflow> </action> </coordinator-app>
  • 21. Oozie Coordinator Application A HDFS directory containing: Definition file: coordinator.xml Configuration file: coord-config-default.xml
  • 22. Running an Oozie Coordinator Job Application Deployment: $ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job Coordinator Job Parameters: $ cat job.properties oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job Job Execution: $ oozie job –run -config job.properties job: 1-20090525161321-oozie-xyz-C
  • 23. Monitoring an Oozie Coordinator Job Coordinator Job Status: $ oozie job -info 1-20090525161321-oozie-xyz-C ------------------------------------------------------------------------ Job Name : wordcount-coord App Path : hdfs://bar.com:9000/usr/abc/coord_job Status : RUNNING … Coordinator Job Log: $ oozie job –log 1-20090525161321-oozie-xyz-C Coordinator Job Definition: $ oozie job –definition 1-20090525161321-oozie-xyz-C
  • 24. Oozie Web Console: List Jobs
  • 25. Oozie Web Console: Job Details
  • 26. Oozie Web Console: Failed Action
  • 27. Oozie Web Console: Error Messages
  • 28. What’s Next For Oozie? New Features More out-of-the-box actions: distcp, hive, … Authentication framework Authenticate a client with Oozie Authenticate an Oozie workflow with downstream services Bundles: Manage multiple coordinators together Asynchronous data sets and coordinators Scalability Memory footprint Data notification instead of polling Integration with Howl ( http://github.com/yahoo/howl )
  • 29. We Need You! Oozie is Open Source Source: http://github.com/yahoo/oozie Docs: http://yahoo.github.com/oozie List: http://tech.groups.yahoo.com/group/Oozie-users/ To Contribute: https://github.com/yahoo/oozie/wiki/How-To-Contribute