SlideShare a Scribd company logo
Apache Falcon
Data Management Platform for Hadoop
Apache Falcon - Data Management Platform For Hadoop
Apache Falcon - Data Management Platform For Hadoop
●Ajay Yadava
● Committer, Apache Falcon
● Lead - Apache Falcon @ Inmobi
4
What is Apache Falcon?
Falcon is a data processing and management solution for Hadoop
designed for data motion, coordination of data pipelines, lifecycle
management, and data discovery. Falcon enables end consumers to
quickly onboard their data and its associated processing and
management tasks on Hadoop clusters.
5
6
Core Services
Core
Services
Process
Relays
Late
Data
Manage
ment
Retentio
n
Replicati
on
Acquisiti
on
Operabili
ty
Holistic
declarati
on of
intent
Anonymi
zation
Lineage
SLA
Life of Byte
8
Data Relays
9
Data Retention as a service
Late Data Management
Data Replication as a service
Data Acquisition As Service
14
Holistic Declaration of Intent
Operability – Dashboard
Overview
17
Entity Dependency Graph
Cluster
Feed Process
depends
depends
depends
Cluster specification
19
<cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1">
<interfaces>
<interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/>
<interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/>
<interface type="execute" endpoint="rm:8050" version="1.1.2"/>
<interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/>
<interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/>
</interfaces>
<locations>
<location name="staging" path="/projects/falcon/staging"/> <!--mandatory-->
<location name="temp" path="/projects/falcon/tmp"/> <!--optional-->
<location name="working" path="/projects/falcon/working"/> <!--optional-->
</locations>
</cluster>
Used by distcp for replication Writing to HDFS
Used to submit processes as MR
Submit oozie jobs
Used for alerts
HDFS directories used by Falcon
Hive metastore to
register/deregister partitions & get
data availability events
Feed specification
<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1">
<frequency>minutes(5)</frequency>
<late-arrival cut-off="hours(1)"/>
<sla slaLow="hours(2)" slaHigh="hours(3)"/>
<clusters>
<cluster name="primary" type="source">
<validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name="secondary" type="target">
<validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
<locations>
<location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
</cluster>
</clusters>
<locations>
<location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
<ACL owner="testuser-ut-user" group="group" permission="0x644"/>
</feed>
20
Frequency
Location
SLA Monitoring
Data Retention
Data Replication
Process specification
21
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1">
<clusters>
<cluster name="corp">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/>
</cluster>
<parallel>1</parallel>
<order>LIFO</order>
<frequency>hours(1)</frequency>
<inputs>
<input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/>
</inputs>
<outputs>
<output name="clicksummary" feed="click-hourly" instance="today(0,0)"/>
</outputs>
<workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/>
<retry policy="periodic" delay="hours(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="click" workflow-path="hdfs://clicks/late/workflow"/>
</late-process>
</process>
Where should the
process run?
How should the process
run?
What to consume?
What to produce?
Late Data Handling
Retry
Architecture
22
23
24
Falcon Unit
A pipeline validation framework
25
Motivation for Falcon Unit
● User errors caught only at deploy time.
● Input/Output feeds and paths not getting resolved.
● Errors in specification.
● Integration Tests require environment setup/tearDown.
● Messy deployment scripts.
● Debugging was cumbersome.
26
Falcon Unit
27
Falcon
Unit
In Process execution env.
● Local Oozie
● Local File System
● Local Job Runner
● Local Message Queue
Actual cluster
● Oozie
● HDFS
● YARN
● Active MQ
Test
suite
Example
Process Submission:
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode
Process Scheduling:
scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);
Process Verification:
getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);
28
29
30
31
32
Deployment
33
34
Embedded Mode
Distributed Mode
35
Monitoring
36
SLA Monitoring
●Alerts based on data Availability
●Dashboard
●Pluggable Alerting System
● Email
● JMS Notifications
37
Pipeline view
38
Triage
39
40
●Better Authentication and Authorization
●Even Better UI
●Even Better monitoring
●Process SLAs
●Streaming support
●A more powerful scheduler
●Pipeline Recovery
41
Community
42
43
Questions?
●Apache Falcon
● falcon.apache.org
● dev@falcon.apache.org / user@falcon.apache.org
●Ajay Yadava
● ajayyadava@apache.org
44

More Related Content

PPTX
Apache Falcon at Hadoop Summit Europe 2014
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PDF
Apache Falcon at Hadoop Summit 2013
PPTX
Hadoop first ETL on Apache Falcon
PPTX
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
PPTX
Classification based security in Hadoop
PPTX
What's new in Ambari
PPTX
Enterprise Data Classification and Provenance
Apache Falcon at Hadoop Summit Europe 2014
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon at Hadoop Summit 2013
Hadoop first ETL on Apache Falcon
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Classification based security in Hadoop
What's new in Ambari
Enterprise Data Classification and Provenance

What's hot (20)

PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
PDF
Hortonworks Technical Workshop - HDP Search
PPTX
Ranger admin dev overview
PDF
Hadoop Security
PPTX
Operating and Supporting Apache HBase Best Practices and Improvements
PPTX
Hive 3 - a new horizon
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PPTX
SQL On Hadoop
PPTX
Building and managing complex dependencies pipeline using Apache Oozie
PDF
Apache ranger meetup
PPTX
Hadoop crash course workshop at Hadoop Summit
PDF
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
PDF
Ingesting Data at Blazing Speed Using Apache Orc
PDF
Keynote from ApacheCon NA 2011
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PPTX
HDP Next: Governance
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
Apache Hive 2.0: SQL, Speed, Scale
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Hortonworks Technical Workshop - HDP Search
Ranger admin dev overview
Hadoop Security
Operating and Supporting Apache HBase Best Practices and Improvements
Hive 3 - a new horizon
The Future of Apache Hadoop an Enterprise Architecture View
Hadoop & Cloud Storage: Object Store Integration in Production
SQL On Hadoop
Building and managing complex dependencies pipeline using Apache Oozie
Apache ranger meetup
Hadoop crash course workshop at Hadoop Summit
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Ingesting Data at Blazing Speed Using Apache Orc
Keynote from ApacheCon NA 2011
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
HDP Next: Governance
Building a Hadoop Data Warehouse with Impala
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
Ad

Similar to Apache Falcon - Data Management Platform For Hadoop (20)

PDF
Mysql nowwhat
PDF
October 2013 HUG: Oozie 4.x
PDF
Toulouse Java User Group
PDF
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
PDF
Cdcr apachecon-talk
PDF
Breaking SAP portal (HackerHalted)
PPT
Web Oriented Architecture at Oracle
PPTX
UCS Management APIs A Technical Deep Dive
PDF
Breaking SAP portal (HashDays)
PDF
Breaking SAP portal (DeepSec)
PPTX
Real-time Analytics for Data-Driven Applications
PPTX
GemFire In Memory Data Grid
PPTX
GemFire In-Memory Data Grid
PDF
Rich Portlet Development in uPortal
PPTX
Apache Falcon _ Hadoop User Group France 22-sept-2014
PDF
Webinar: What's new in CDAP 3.5?
PDF
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
PDF
Burn down the silos! Helping dev and ops gel on high availability websites
PPTX
Apache Falcon - Sanjeev Tripurari
PPTX
Apache Falcon DevOps
Mysql nowwhat
October 2013 HUG: Oozie 4.x
Toulouse Java User Group
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Cdcr apachecon-talk
Breaking SAP portal (HackerHalted)
Web Oriented Architecture at Oracle
UCS Management APIs A Technical Deep Dive
Breaking SAP portal (HashDays)
Breaking SAP portal (DeepSec)
Real-time Analytics for Data-Driven Applications
GemFire In Memory Data Grid
GemFire In-Memory Data Grid
Rich Portlet Development in uPortal
Apache Falcon _ Hadoop User Group France 22-sept-2014
Webinar: What's new in CDAP 3.5?
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Burn down the silos! Helping dev and ops gel on high availability websites
Apache Falcon - Sanjeev Tripurari
Apache Falcon DevOps
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Business Analytics and business intelligence.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Lecture1 pattern recognition............
PPT
Quality review (1)_presentation of this 21
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Analytics and business intelligence.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ISS -ESG Data flows What is ESG and HowHow
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
Lecture1 pattern recognition............
Quality review (1)_presentation of this 21
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Reliability_Chapter_ presentation 1221.5784

Apache Falcon - Data Management Platform For Hadoop

  • 1. Apache Falcon Data Management Platform for Hadoop
  • 4. ●Ajay Yadava ● Committer, Apache Falcon ● Lead - Apache Falcon @ Inmobi 4
  • 5. What is Apache Falcon? Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters. 5
  • 6. 6
  • 10. Data Retention as a service
  • 12. Data Replication as a service
  • 14. 14
  • 18. Entity Dependency Graph Cluster Feed Process depends depends depends
  • 19. Cluster specification 19 <cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/> <interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/> <interface type="execute" endpoint="rm:8050" version="1.1.2"/> <interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <!--mandatory--> <location name="temp" path="/projects/falcon/tmp"/> <!--optional--> <location name="working" path="/projects/falcon/working"/> <!--optional--> </locations> </cluster> Used by distcp for replication Writing to HDFS Used to submit processes as MR Submit oozie jobs Used for alerts HDFS directories used by Falcon Hive metastore to register/deregister partitions & get data availability events
  • 20. Feed specification <feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="primary" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> <ACL owner="testuser-ut-user" group="group" permission="0x644"/> </feed> 20 Frequency Location SLA Monitoring Data Retention Data Replication
  • 21. Process specification 21 <process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process> </process> Where should the process run? How should the process run? What to consume? What to produce? Late Data Handling Retry
  • 23. 23
  • 24. 24
  • 25. Falcon Unit A pipeline validation framework 25
  • 26. Motivation for Falcon Unit ● User errors caught only at deploy time. ● Input/Output feeds and paths not getting resolved. ● Errors in specification. ● Integration Tests require environment setup/tearDown. ● Messy deployment scripts. ● Debugging was cumbersome. 26
  • 27. Falcon Unit 27 Falcon Unit In Process execution env. ● Local Oozie ● Local File System ● Local Job Runner ● Local Message Queue Actual cluster ● Oozie ● HDFS ● YARN ● Active MQ Test suite
  • 28. Example Process Submission: submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode Process Scheduling: scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName); Process Verification: getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime); 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 37. SLA Monitoring ●Alerts based on data Availability ●Dashboard ●Pluggable Alerting System ● Email ● JMS Notifications 37
  • 40. 40
  • 41. ●Better Authentication and Authorization ●Even Better UI ●Even Better monitoring ●Process SLAs ●Streaming support ●A more powerful scheduler ●Pipeline Recovery 41
  • 43. 43
  • 44. Questions? ●Apache Falcon ● falcon.apache.org ● dev@falcon.apache.org / user@falcon.apache.org ●Ajay Yadava ● ajayyadava@apache.org 44