SlideShare a Scribd company logo
Hadoop First ETL On
Apache Falcon
Srikanth Sundarrajan
Naresh Agarwal
About Authors
 Srikanth Sundarrajan
 Principal Architect, InMobi Technology Services
 Naresh Agarwal
 Director – Engineering, InMobi Technology Services
Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals
Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals
ETL (Extract Transform Load)
Intelligence
Information
Data
Value
ETL Use cases
Data
Warehouse
Data
Migration
Data
Consolidation
Master Data
Management
Data
Synchronization
Data Archiving
ETL Authoring
Hand
coded
In-house
tools
Off-
shelf
tools
ETL & Big Data – Challenges
Challenges
Volume
VarietyVelocity
Big Data ETL
 Mostly Hand coded (High Cost – Implementation +
Maintenance)
 Map Reduce
 Hive (i.e. SQL)
 Pig
 Crunch / Cascading
 Spark
 Off-shelf tools (Scale/Performance)
 Mostly Retrofitted
Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals
Apache Falcon
 Off the shelf, Falcon provides standard data
management functions through declarative constructs
 Data movement recipes
 Cross data center replication
 Cross cluster data synchronization
 Data retention recipes
 Eviction
 Archival
Apache Falcon
 However ETL related functions are still largely left to
the developer to implement. Falcon today manages
only
 Orchestration
 Late data handling / Change data capture
 Retries
 Monitoring
Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals
Pipeline Designer – Basics
Pipeline Designer – Basics
 Feed
 Is a data entity that Falcon manages and is physically
present in a cluster.
 Data present in this feed conforms to a schema and
partitions of the same are registered with Hcatalog
 Data Management functions such as eviction, archival etc
are declaratively specified through Falcon Feed
definitions
Pipeline Designer – Basics
Pipeline Designer – Basics
 Process
 Workflow that defines various actions that needs to be
performed along with control flow
 Executes at a specified frequency on one or more
clusters
 Pipelines
 Logical grouping of Falcon processes owned and
operated together
Pipeline Designer – Basics
Pipeline Designer – Basics
 Actions
 Actions in designer are the building blocks for the process
workflows.
 Actions have access to output variables earlier in the flow
and can emit output variables
 Actions can transition to other actions
 Default / Success Transition
 Failure Transition
 Conditional Transition
 Transformation action is a special action that further is a
collection of transforms
Pipeline Designer – Basics
Pipeline Designer – Basics
 Transforms
 Is a data manipulation function that accepts one or more
inputs with well defined schema and produces ore or
more outputs
 Multiple transform elements can be stitched together to
compose a single transformation action which can further
be used to build a flow
 Composite Transformations
 Transforms that are built through a combination of multiple
primitive transforms
 Possible to add more transforms and extend the system
Pipeline Designer – Basics
 Deployment & Monitoring
 Once a process and the pipeline is composed, the same
is deployed in Falcon as a standard process
Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals
Pipeline Designer Service
Pipeline Designer
Pipeline
Designer
Service
REST API
Versioned
Storage
Flow / Action
/Transforms
Compiler +
Optimizer
Falcon
Server
Hcatalog
Service
DesignerUI
FalconDashboard
Process
Feed
Schema
Pipeline Designer – Internals
 Transformation actions are compiled into PIG scripts
 Actions and Flows are compiled into Falcon Process
definitions
Mocks
Q & A
Thanks
mailto:sriksun@apache.org
mailto:naresh.agarwal@inmobi.com

More Related Content

PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PDF
Apache Falcon at Hadoop Summit 2013
PPTX
Apache Falcon at Hadoop Summit Europe 2014
PPTX
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
PPTX
Apache Falcon - Data Management Platform For Hadoop
PPTX
Securing Hadoop with Apache Ranger
PPTX
Best Practices for Enterprise User Management in Hadoop Environment
PDF
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon at Hadoop Summit 2013
Apache Falcon at Hadoop Summit Europe 2014
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Apache Falcon - Data Management Platform For Hadoop
Securing Hadoop with Apache Ranger
Best Practices for Enterprise User Management in Hadoop Environment
Hortonworks Technical Workshop - build a yarn ready application with apache ...

What's hot (20)

PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
PPTX
Internet of things Crash Course Workshop
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
PDF
Hortonworks tech workshop in-memory processing with spark
PPTX
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
PPTX
Integrating Apache Spark and NiFi for Data Lakes
PPT
Running Zeppelin in Enterprise
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
PPTX
Enterprise Data Classification and Provenance
PPTX
Falcon Meetup
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
LEGO: Data Driven Growth Hacking Powered by Big Data
PPTX
Enabling Diverse Workload Scheduling in YARN
PPTX
Deploying Docker applications on YARN via Slider
PDF
Delivering Apache Hadoop for the Modern Data Architecture
PDF
Hortonworks Technical Workshop - HDP Search
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
The Future of Apache Hadoop an Enterprise Architecture View
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Internet of things Crash Course Workshop
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks Technical Workshop: What's New in HDP 2.3
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks tech workshop in-memory processing with spark
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Integrating Apache Spark and NiFi for Data Lakes
Running Zeppelin in Enterprise
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Enterprise Data Classification and Provenance
Falcon Meetup
Apache Hive 2.0: SQL, Speed, Scale
LEGO: Data Driven Growth Hacking Powered by Big Data
Enabling Diverse Workload Scheduling in YARN
Deploying Docker applications on YARN via Slider
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks Technical Workshop - HDP Search
Hadoop & Cloud Storage: Object Store Integration in Production
Ad

Viewers also liked (6)

PPTX
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
PDF
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
PPTX
Selective Data Replication with Geographically Distributed Hadoop
PDF
Hadoop概要説明
PDF
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Selective Data Replication with Geographically Distributed Hadoop
Hadoop概要説明
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Ad

Similar to Hadoop first ETL on Apache Falcon (20)

PDF
Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
PPTX
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
PDF
10 basic terms so you can talk to data engineer
PDF
Andrii Soldatenko "The art of data engineering"
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
PPTX
ETL big data with apache hadoop
PDF
Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014
PDF
Data Onboarding
PDF
Data Onboarding
PDF
Juan Riaza | Defining data pipelines workflows using Apache Airflow | Codemot...
PPTX
PPTX
Become Data Driven With Hadoop as-a-Service
PPTX
Intro to Big Data - Orlando Code Camp 2014
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PPTX
SplunkLive! Presentation - Data Onboarding with Splunk
PPTX
Hadoop Turns a Corner and Sees the Future
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
PPTX
Intro to Hadoop
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
10 basic terms so you can talk to data engineer
Andrii Soldatenko "The art of data engineering"
Running Airflow Workflows as ETL Processes on Hadoop
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
ETL big data with apache hadoop
Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014
Data Onboarding
Data Onboarding
Juan Riaza | Defining data pipelines workflows using Apache Airflow | Codemot...
Become Data Driven With Hadoop as-a-Service
Intro to Big Data - Orlando Code Camp 2014
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
SplunkLive! Presentation - Data Onboarding with Splunk
Hadoop Turns a Corner and Sees the Future
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Intro to Hadoop
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction

Hadoop first ETL on Apache Falcon

  • 1. Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal
  • 2. About Authors  Srikanth Sundarrajan  Principal Architect, InMobi Technology Services  Naresh Agarwal  Director – Engineering, InMobi Technology Services
  • 3. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
  • 4. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
  • 5. ETL (Extract Transform Load) Intelligence Information Data Value
  • 6. ETL Use cases Data Warehouse Data Migration Data Consolidation Master Data Management Data Synchronization Data Archiving
  • 8. ETL & Big Data – Challenges Challenges Volume VarietyVelocity
  • 9. Big Data ETL  Mostly Hand coded (High Cost – Implementation + Maintenance)  Map Reduce  Hive (i.e. SQL)  Pig  Crunch / Cascading  Spark  Off-shelf tools (Scale/Performance)  Mostly Retrofitted
  • 10. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
  • 11. Apache Falcon  Off the shelf, Falcon provides standard data management functions through declarative constructs  Data movement recipes  Cross data center replication  Cross cluster data synchronization  Data retention recipes  Eviction  Archival
  • 12. Apache Falcon  However ETL related functions are still largely left to the developer to implement. Falcon today manages only  Orchestration  Late data handling / Change data capture  Retries  Monitoring
  • 13. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
  • 15. Pipeline Designer – Basics  Feed  Is a data entity that Falcon manages and is physically present in a cluster.  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
  • 17. Pipeline Designer – Basics  Process  Workflow that defines various actions that needs to be performed along with control flow  Executes at a specified frequency on one or more clusters  Pipelines  Logical grouping of Falcon processes owned and operated together
  • 19. Pipeline Designer – Basics  Actions  Actions in designer are the building blocks for the process workflows.  Actions have access to output variables earlier in the flow and can emit output variables  Actions can transition to other actions  Default / Success Transition  Failure Transition  Conditional Transition  Transformation action is a special action that further is a collection of transforms
  • 21. Pipeline Designer – Basics  Transforms  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow  Composite Transformations  Transforms that are built through a combination of multiple primitive transforms  Possible to add more transforms and extend the system
  • 22. Pipeline Designer – Basics  Deployment & Monitoring  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
  • 23. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
  • 24. Pipeline Designer Service Pipeline Designer Pipeline Designer Service REST API Versioned Storage Flow / Action /Transforms Compiler + Optimizer Falcon Server Hcatalog Service DesignerUI FalconDashboard Process Feed Schema
  • 25. Pipeline Designer – Internals  Transformation actions are compiled into PIG scripts  Actions and Flows are compiled into Falcon Process definitions
  • 26. Mocks
  • 27. Q & A

Editor's Notes

  • #4: We basically are going to look at general applications & use cases of ETL and what are specific challenges with respect to ETL over Big data Then we see how Apache Falcon attempts to address these in the upcoming feature Pipeline Designer is a new feature being added to Falcon to support ETL authoring capabilities and we look into specifics of this feature and the designer internals Finally we look at some mocks of this feature to get a sense of how this would shape.
  • #6: As data is further refined, curated and processed into meaningful information and insights/intelligence, higher order value is derived out of it. ETL play a pivot role in this derivation process. Decades ago, data used to reside in just one or very few systems and data integration / ETL weren’t domainant problems, but as the system got broken down into numerous sub system this has assumed a lot of significance. With a explosion and focus on data, the needs and complexity are only to increase further.
  • #7: Data warehousing is probably the one of the most common use case one might have come across in the context of ETL, but there are other use cases besides data warehousing and business intelligence. Data Migration – When migrating one data model to another or migrating from one system to another Data Consolidation – Often times during Mergers & Acquisition one might end up with a need to consolidate Data Archiving – Moving data to low cost storage mostly to support compliance requirements Master Data Management – To support single source of truth for master data across all system within an organization Data Synchronization – To support cross data center for DR and BCP purposes
  • #8: ETL have for the longest period in history been authored through hand coded scripts, in house tools specifically catering to the context of a business or through general purpose off-shelf tools with possibly wide variety of connectors and plugins.
  • #9: When it comes to large scale or big data the challenges are further compounded. Volume – Scale & Size Variety – Diverse sources, dynamic schema / unstructured Velocity – Freshness, Cycle turn around time