SlideShare a Scribd company logo
Data Management Platform
on Hadoop
Srikanth Sundarrajan
Venkatesh Seetharam
(Incubating)
whoami
Principal Architect
InMobi
Apache Hadoop
Contributor
Hadoop Team
@Yahoo!
Srikanth
Sundarrajan
Architect/Developer
Hortonworks
Apache Hadoop
Contributor
Data Management
@ Yahoo!
Venkatesh
Seetharam
Agenda
2 Falcon Overview
1 Motivation
3 Case Studies
4 Questions & Answers
MOTIVATION
Data Processing Landscape
External
data
source
Acquire
(Import)
Data Processing
(Transform/Pipeline
)
Eviction Archive
Replicate
(Copy)
Export
Core Services
Process
• Late data management
• Relays
Data
management
• Acquisition
• Replication
• Retention
Operability
• SLA
• Lineage
Process Management – Relays
picture courtersy: http://istockphoto.com/
Late Data Management
picture courtersy: http://iwebask.com
Data Retention As Service
picture courtersy: http://vimeo.com/
Data Replication As Service
picture courtersy: http://boylesmedia.com
Data Acquisition As Service
picture courtersy: http://wmpu.org
Operability – Dashboard
picture courtersy: http://www.opentrack.ch/
FALCON OVERVIEW
Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com
Entity Dependency Graph
Hadoop /
Hbase …
Cluster
External
data
source
feed Process
depends
depends
High Level Architecture
Apache
Falcon
Oozie
Messaging
HCatalog
Hadoop
Entity
Entity
status
Process
status /
notification
CLI/RES
T
JMS
Config
store
Feed Schedule
Cluster
xml
Feed xml Falcon
Falcon config
store / Graph
Retention /
Replication
workflow
Oozie
Scheduler HDFS
JMS Notification
per action
Catalog
service
Instance
Management
Process Schedule
Cluster/fe
ed xml
Process
xml
Falcon
Falcon config
store / Graph
Process
workflow
Oozie
Scheduler HDFS
JMS Notification
per available
feed
Catalog
service
Instance
Management
Physical Architecture
Falcon Colo 1
Falcon Colo 2
Falcon Colo 3
Scheduler
Scheduler
Scheduler
Falcon – Prism
Global view
CASE STUDY
Multi Cluster Failover
Apache Falcon at Hadoop Summit 2013
Apache Falcon at Hadoop Summit 2013
CASE STUDY
Distributed Processing
Example: Digital Advertising @ InMobi
Hadoop @ InMobi
 About InMobi
 Worlds leading independent mobile advertising company
 Hadoop usage at InMobi
 ~ 6 Clusters
 > 1PB of storage
 > 5TB new data ingested each day
 > 20TB data crunched each day
 > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase
 > 175K hadoop jobs / day
 > 60K Oozie workflows / day
 300+ Falcon feed definitions
 100+ Falcon process definitions
Processing – Single Data Center
Ad Request
data
Impression
render event
Click event
Conversion
event
Continuou
s
Streaming
(minutely)
Hourly
summary
Enrichment
(minutely/5
minutely)
Summarizer
Global Aggregation
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
……..
DataCenter1
DataCenterN
Consumable
global aggregate
HIGHLIGHTS
Future
Security
Embed Pig/Hive scripts
Data Acquisition – file-based
Monitoring/Management
Dashboard
Summary
Questions?
 Apache Falcon
 http://falcon.incubator.apache.org
 mailto: dev@falcon.incubator.apache.org
 Srikanth Sundarrajan
 sriksun@apache.org
 #sriksun
 Venkatesh Seetharam
 venkatesh@apache.org
 #innerzeal

More Related Content

PPTX
Apache Falcon at Hadoop Summit Europe 2014
PPTX
Hadoop first ETL on Apache Falcon
PPTX
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
PPTX
Apache Falcon - Data Management Platform For Hadoop
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PDF
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
PPTX
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
PDF
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Apache Falcon at Hadoop Summit Europe 2014
Hadoop first ETL on Apache Falcon
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Apache Falcon - Data Management Platform For Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive

What's hot (20)

PDF
Discover HDP 2.1: Apache Solr for Hadoop Search
PPTX
Securing Hadoop with Apache Ranger
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
PDF
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PDF
Discover.hdp2.2.h base.final[2]
PPTX
Best Practices for Enterprise User Management in Hadoop Environment
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
PPTX
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
PDF
Hortonworks Technical Workshop - build a yarn ready application with apache ...
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
PDF
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
Classification based security in Hadoop
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PPTX
Integrating Apache Spark and NiFi for Data Lakes
PPTX
Creating the Internet of Your Things
Discover HDP 2.1: Apache Solr for Hadoop Search
Securing Hadoop with Apache Ranger
Implementing a Data Lake with Enterprise Grade Data Governance
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Security and Data Governance using Apache Ranger and Apache Atlas
Discover.hdp2.2.h base.final[2]
Best Practices for Enterprise User Management in Hadoop Environment
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Hortonworks Technical Workshop - build a yarn ready application with apache ...
The Future of Apache Hadoop an Enterprise Architecture View
Hp Converged Systems and Hortonworks - Webinar Slides
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Apache NiFi in the Hadoop Ecosystem
Classification based security in Hadoop
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Integrating Apache Spark and NiFi for Data Lakes
Creating the Internet of Your Things
Ad

Similar to Apache Falcon at Hadoop Summit 2013 (20)

PPTX
Data Management on Hadoop at Yahoo!
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Big Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
PDF
Big Data Telecom
PPT
Hadoop_Its_Not_Just_Internal_Storage_V14
PDF
Hadoop and SAP BI
PDF
Wed-12-05pm-box-salmanahmed
PPTX
Apache Falcon _ Hadoop User Group France 22-sept-2014
PDF
Steps to Modernize Your Data Ecosystem with Mindtree Blog
PDF
6 Steps to Modernize Data Ecosystem with Mindtree
PDF
Six Steps to Modernize Your Data Ecosystem - Mindtree
PDF
Steps to Modernize Your Data Ecosystem | Mindtree
PPTX
Hybrid Data Platform
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
PDF
Hadoop is Happening
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
PPTX
Simplifying and Future-Proofing Hadoop
PPTX
Big Data and Hadoop
PPTX
Hadoop in a Nutshell
Data Management on Hadoop at Yahoo!
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Big Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
Big Data Telecom
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop and SAP BI
Wed-12-05pm-box-salmanahmed
Apache Falcon _ Hadoop User Group France 22-sept-2014
Steps to Modernize Your Data Ecosystem with Mindtree Blog
6 Steps to Modernize Data Ecosystem with Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
Hybrid Data Platform
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Hadoop is Happening
Building a Turbo-fast Data Warehousing Platform with Databricks
The Maturity Model: Taking the Growing Pains Out of Hadoop
Simplifying and Future-Proofing Hadoop
Big Data and Hadoop
Hadoop in a Nutshell
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
sap open course for s4hana steps from ECC to s4
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Apache Falcon at Hadoop Summit 2013

Editor's Notes

  • #6: In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business & financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a fewIngesting large volumes of events/streamsIngesting slowly changing data typically available on a traditional databaseCreating a pipeline / sequence of processing logic to extract the desired piece of insight / informationHandling processing complexities relating to change of data / failuresManaging eviction of older data elementsBackup the data in an alternate location or archive it in a cheaper storage for DR/BCP & Compliance requirementsShip data out of the hadoop environment periodically for machine or human consumption etcThese tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.========================What do we mean by DMPlatform should provide these as services to users so users worry about business processingCaptures common themes and follows best practicesFrees users from such
  • #7: As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along.
  • #8: More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator.As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines.
  • #9: From our experience there are typically two reasons why large volumes of data are processed, namelySLA critical machine consumable data (with some tolerance to error)Factual reporting with a “Close of Books” notion for human consumable (not always but frequently enough)While the first class of application doesn’t get affected much if some small percentage of data arrives late. Some examples of these class of applications include forecasting, predictions, risk management etc.However the second class of application are used for factual reporting, results of which may be subject to audit. For these use cases, it is not acceptable to ignore data that arrived out of order or late. The platform in such cases need to provide an option to the application author the ability to detect arrival of late data and enable re-processing. This might also require a cascading reprocess flow of all downstream apps. This service being available off the shelf to the application developer would relieve him/her of the pain of having to manage this themselves.
  • #10: A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting & purging older instances of data. Few problems to consider for retention areShould avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons)Different types of data may require different criteria for aging and hence purgingOther life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks in
  • #11: Hadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service areBandwidth consumption and managementChunking/bulking strategyCorrectness guaranteesHDFS version compatibility issues =========================2 Dimensions:BCP/DRLocal/Global Agg – ship local aggs as part of a pipeline
  • #13: Integrated view of what is happening currently in the system based on the holistic information about all the elements in the system (data, associated management functions, processing logic and the location) provide for a compelling view of the “State of the system” at any time. This is a much needed platform feature for the larger goal of “allowing data application developer to focus on the business or processing logic”.Adding alerting & notifications to this will complete the operability story.===============================DashboardAlertsNotifications
  • #15: Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop.
  • #16: The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. At the coreDependency between Data Processing logic andCluster end pointsRules governing Data managementProcessing managementMetadata management
  • #17: System accepts entities using DSLInfrastructure, Datasets, Pipeline/Processing logicTransforms the input into automated and scheduled workflowsSystem orchestrates workflowsInstruments execution of configured policiesHandles retry logic and late data processingRecords audit, lineage Seamless integration with metastore/catalog (WIP)Provides notifications based on availabilityIntegrated Seamless experience to usersAutomates processing and tracks the end to end progressData Set management (Replication, Retention, etc.) offered as a serviceUsers can cherry pick, No coupling between primitivesProvides hooks for monitoring, metrics collection
  • #26: Ad Request, Click, Impression, Conversion feedMinutely (with identical location, retention configuration, but with many data centers)Summary dataHourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter)Click, Impression Conversion enrichment & Summarizer processesSingle definition with multiple data centersIdentical periodicity and scheduling configuration