Hadoop first ETL on Apache Falcon

Hadoop First ETL On
Apache Falcon
Srikanth Sundarrajan
Naresh Agarwal

About Authors
 Srikanth Sundarrajan
 Principal Architect, InMobi Technology Services
 Naresh Agarwal
 Director – Engineering, InMobi Technology Services

Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals

ETL (Extract Transform Load)
Intelligence
Information
Data
Value

ETL Use cases
Data
Warehouse
Data
Migration
Data
Consolidation
Master Data
Management
Data
Synchronization
Data Archiving

ETL Authoring
Hand
coded
In-house
tools
Off-
shelf
tools

ETL & Big Data – Challenges
Challenges
Volume
VarietyVelocity

Big Data ETL
 Mostly Hand coded (High Cost – Implementation +
Maintenance)
 Map Reduce
 Hive (i.e. SQL)
 Pig
 Crunch / Cascading
 Spark
 Off-shelf tools (Scale/Performance)
 Mostly Retrofitted

Apache Falcon
 Off the shelf, Falcon provides standard data
management functions through declarative constructs
 Data movement recipes
 Cross data center replication
 Cross cluster data synchronization
 Data retention recipes
 Eviction
 Archival

Apache Falcon
 However ETL related functions are still largely left to
the developer to implement. Falcon today manages
only
 Orchestration
 Late data handling / Change data capture
 Retries
 Monitoring

Pipeline Designer – Basics
 Feed
 Is a data entity that Falcon manages and is physically
present in a cluster.
 Data present in this feed conforms to a schema and
partitions of the same are registered with Hcatalog
 Data Management functions such as eviction, archival etc
are declaratively specified through Falcon Feed
definitions

 Process
 Workflow that defines various actions that needs to be
performed along with control flow
 Executes at a specified frequency on one or more
clusters
 Pipelines
 Logical grouping of Falcon processes owned and
operated together

 Actions
 Actions in designer are the building blocks for the process
workflows.
 Actions have access to output variables earlier in the flow
and can emit output variables
 Actions can transition to other actions
 Default / Success Transition
 Failure Transition
 Conditional Transition
 Transformation action is a special action that further is a
collection of transforms

 Transforms
 Is a data manipulation function that accepts one or more
inputs with well defined schema and produces ore or
more outputs
 Multiple transform elements can be stitched together to
compose a single transformation action which can further
be used to build a flow
 Composite Transformations
 Transforms that are built through a combination of multiple
primitive transforms
 Possible to add more transforms and extend the system

 Deployment & Monitoring
 Once a process and the pipeline is composed, the same
is deployed in Falcon as a standard process

Pipeline Designer Service
Pipeline Designer
Pipeline
Designer
Service
REST API
Versioned
Storage
Flow / Action
/Transforms
Compiler +
Optimizer
Falcon
Server
Hcatalog
Service
DesignerUI
FalconDashboard
Process
Feed
Schema

Pipeline Designer – Internals
 Transformation actions are compiled into PIG scripts
 Actions and Flows are compiled into Falcon Process
definitions

Thanks
mailto:sriksun@apache.org
mailto:naresh.agarwal@inmobi.com

Hadoop first ETL on Apache Falcon

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Hadoop first ETL on Apache Falcon (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Hadoop first ETL on Apache Falcon

Editor's Notes