SlideShare a Scribd company logo
Page 1 © Hortonworks Inc. 2014
Discover HDP 2.1
Apache Falcon for Data Governance in Hadoop
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Himanshu Bari
Hortonworks Senior Product Manager & PM for Apache
Falcon & Apache Storm in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Engineer & Committer for
Apache Falcon and Apache Knox Gateway projects
Page 3 © Hortonworks Inc. 2014
Agenda
•  Why You Need Apache Falcon
•  Key New Falcon Features
•  Demo
–  Defining data pipelines
–  Policies for retention
–  Managing Falcon server with Apache Ambari
Page 4 © Hortonworks Inc. 2014
OPERATIONS	
  TOOLS	
  
Provision,
Manage &
Monitor
DEV	
  &	
  DATA	
  TOOLS	
  
Build & Test
A Modern Data Architecture
APPLICATIONS	
  DATA	
  	
  SYSTEM	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
Business	
  	
  
Analy<cs	
  
Custom	
  Applica<ons	
  
Packaged	
  
Applica<ons	
  
Governance
&Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
SOURCES	
  
OLTP,	
  ERP,	
  
CRM	
  Systems	
  
Documents,	
  	
  
Emails	
  
Web	
  Logs,	
  
Click	
  Streams	
  
Social	
  Networks	
   Machine	
  
Generated	
  
Sensor	
  
Data	
  
GeolocaCon	
  Data	
  
Page 5 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
	
  	
  
Provision,	
  
Manage	
  &	
  
Monitor	
  
	
  
Ambari	
  
Zookeeper	
  
Scheduling	
  
	
  
Oozie	
  
Data	
  Workflow,	
  
Lifecycle	
  &	
  
Governance	
  
	
  
Falcon	
  
Sqoop	
  
Flume	
  
NFS	
  
WebHDFS	
  
YARN	
  :	
  Data	
  Opera<ng	
  System	
  
DATA	
  	
  MANAGEMENT	
  
DATA	
  	
  ACCESS	
  
GOVERNANCE	
  &	
  
INTEGRATION	
  
OPERATIONS	
  
Script	
  
	
  
Pig	
  
	
  
	
  
Search	
  
	
  
Solr	
  
	
  
	
  
SQL	
  
	
  
Hive/Tez,	
  
HCatalog	
  
	
  
	
  
NoSQL	
  
	
  
HBase	
  
Accumulo	
  
	
  
	
  
Stream	
  
	
  	
  
Storm	
  
	
  
	
  
	
  
Others	
  
	
  
In-­‐Memory	
  
AnalyCcs,	
  	
  
ISV	
  engines	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
Batch	
  
	
  
Map	
  
Reduce	
  
	
  
	
  
SECURITY	
  
Authen<ca<on	
  
Authoriza<on	
  
Accoun<ng	
  
Data	
  Protec<on	
  
	
  
Storage:	
  HDFS	
  
Resources:	
  YARN	
  
Access:	
  Hive,	
  …	
  	
  
Pipeline:	
  Falcon	
  
Cluster:	
  Knox	
  
Page 6 © Hortonworks Inc. 2014
NoSQL	
  
	
  
HBase	
  
Accumulo	
  
	
  
	
  
Stream	
  
	
  	
  
Storm	
  
	
  
	
  
	
  
Others	
  
	
  
In-­‐Memory	
  
AnalyCcs,	
  	
  
ISV	
  engines	
  
Script	
  
	
  
Pig	
  
	
  
	
  
Search	
  
	
  
Solr	
  
	
  
	
  
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
	
  	
  
Provision,	
  
Manage	
  &	
  
Monitor	
  
	
  
Ambari	
  
Zookeeper	
  
Scheduling	
  
	
  
Oozie	
  
DATA	
  	
  MANAGEMENT	
  
OPERATIONS	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
SECURITY	
  
Authen<ca<on	
  
Authoriza<on	
  
Accoun<ng	
  
Data	
  Protec<on	
  
	
  
Storage:	
  HDFS	
  
Resources:	
  YARN	
  
Access:	
  Hive,	
  …	
  	
  
Pipeline:	
  Falcon	
  
Cluster:	
  Knox	
  
YARN	
  :	
  Data	
  Opera<ng	
  System	
  
DATA	
  	
  ACCESS	
  
SQL	
  
	
  
Hive/Tez,	
  
HCatalog	
  
	
  
	
  
Batch	
  
	
  
Map	
  
Reduce	
  
	
  
	
  
Data	
  Workflow,	
  
Lifecycle	
  &	
  
Governance	
  
	
  
Falcon	
  
Sqoop	
  
Flume	
  
NFS	
  
WebHDFS	
  
GOVERNANCE	
  &	
  
INTEGRATION	
  
Page 7 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
Page 8 © Hortonworks Inc. 2014
Simple Data Pipeline in Hadoop
Relatively simple Oozie workflow
Job1
Job2 JobN
Job3
Has a
Simple data pipeline
Raw
Data
Clean
Data
Prepped
Data
HDFS data lake
MR/Pig/Hive
BI
TOOLS
Data
Sources
MR/Pig/Hive
Page 9 © Hortonworks Inc. 2014
Quickly Gets Complicated….
Data stewards
•  Impact analysis
•  Monitor pipeline
•  Track ownership
•  Late data &
failure handling
Compliance teams
•  Audit
•  Retention
•  Eviction
IT admins
•  Monitor infra
•  Replication
•  Archival
Business & data
analysts
•  Verify data
quality
Manually
write & wire
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
tools
Eg. DistCp
Typical data governance requirements
Raw Clean Prep
Page 10 © Hortonworks Inc. 2014
Apache Falcon to the Rescue
Data pipeline
Raw Clean Prep
Defined in
Auto generate
& orchestrate
Adds the required data
governance features
Falcon adds the required data governance features
DEFINITION
Replication | Retention
Eviction | Late data
MONITORING
TRACING
Audit | Lineage
Tagging
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
ecosystem
tools
Eg. DistCp
Page 11 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
Page 12 © Hortonworks Inc. 2014
Falcon Basic Concepts
• Feed: Defines a “dataset” so a.k.a ‘datasets’
• Process: Consumes feeds, invokes processing logic & produces feeds
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED
aka
DATASET
PROCESS
INPUT TO
CREATES
• Cluster: : Represents the “interfaces” to a Hadoop cluster
Page 13 © Hortonworks Inc. 2014
Data Pipeline Definition
XML based pipeline specification
Modular - Clusters, feeds & processes defined separately and then linked together
Easy to re-use across multiple pipelines
Out of the box policies
Predefined policies for replication, retention & late data handling Easily customization of policies
Extensible
Plug in external solutions at any step of the pipeline
Eg. Invoke third party data obfuscation components
Page 14 © Hortonworks Inc. 2014
Replication & Retention
Staged Data
Retain 5
Years
Cleansed
Data
Retain 3
Years
Conformed
Data
Retain 3
Years
Presented
Data
Retain Last
Copy Only
•  Sophisticated retention policies expressed in one place
•  Simplify data retention for audit, compliance, or for data re-processing
Page 15 © Hortonworks Inc. 2014
Data Pipeline Monitoring
DATA
Primary site DR site
Centralized monitoring of data pipeline with
Falcon + Ambari
Pipeline
run alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline
run history
Pipeline
scheduling
raw clean prep raw clean prep
Page 16 © Hortonworks Inc. 2014
Data Pipeline Tracing
.
Purchase
feed
Customer
feed
Product
feedStore feed
View dependencies
between clusters,
datasets and
processes
Data pipeline
dependencies
Add arbitrary
tags to feeds &
processes
Credit
feed
Sensitive encrypted
Data pipeline
tagging
Know who
modified a
dataset when
and into what
Data pipeline
audits
File-1
File-2
File-3
Analyze how a
dataset reached
a particular
state
Data pipeline
lineage
Page 17 © Hortonworks Inc. 2014
Falcon User Flow
Create cluster entity
& process XML
specifications
Validate and
save
specifications
to HDFS
Kick off
Feeds &
processes
Schedule
“Instances” of
feeds &
process to run
Ensure feeds
& processes
run as
expected
Update feeds
& processes
as needed
User
Falcon
Server
Falcon CLI
or API
Define pipeline Deploy pipeline Manage pipeline
‘instance’
suspend,
resume, kill
SCHEDULESUBMIT
Page 18 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
Page 19 © Hortonworks Inc. 2014
Falcon Architecture
Centralized Falcon Orchestration
Framework
Hadoop ecosystem tools
Falcon	
  Server	
   JMS	
  
API	
  
&	
  
UI	
  
AMBARI	
  
HDFS / Hive
Oozie
Entity
Specs
Scheduled
Jobs
Process
Status
MapRed / Pig / Hive /
Sqoop / Flume /
DistCP
Data
stewards
+
Hadoop
admins
Page 20 © Hortonworks Inc. 2014
Clickstream enrichment data pipeline
Use case description
•  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../
{date}).
•  Cluster is located in the Oregon data center.
•  Data arrives from all NA-west-coast production servers.
•  The input data feeds are often late for up to 4 hrs.
•  We need to enrich the clickstream data with Ad impression metadata and make it
available to our marketing data science team for customer segmentation analysis.
•  Primary Hadoop cluster does not need the raw and enriched click data after 3 months.
•  Our IT policy requires us to backup all enriched click data and store it for 3 years in
our secondary Hadoop cluster in the Virginia data center.
Page 21 © Hortonworks Inc. 2014
Falcon Entity Relationships
CLICKSTREAM ENRICHMENT PIPELINE
Clicks
DATASET
Enriched
clicks
DATASET
Click
enrichment
PROCESSClicks ingest
PROCESS
Oregon Hadoop cluster
PRIMARY CLUSTER
Virginia
Hadoop cluster
BACKUP
CLUSTER
Creates
Runson
Storedon
Backup
to
Create
Impressions
ingest
PROCESS
Creates Impressions
DATASET
Runson
Page 22 © Hortonworks Inc. 2014
Learn More About Data Governance in Hadoop
Hortonworks.com/labs/data-management/
Register for the remaining 4
Discover HDP 2.1 Webinars
Hortonworks.com/webinars
Next Webinar:
Apache Hadoop 2.4.0,
YARN and HDFS
Wednesday, May 28, 9am Pacific
Page 23 © Hortonworks Inc. 2014
Thank you!

More Related Content

PDF
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
PDF
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
PDF
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
PDF
Discover HDP 2.1: Apache Solr for Hadoop Search
PDF
Discover.hdp2.2.storm and kafka.final
PPTX
Introduction to the Hortonworks YARN Ready Program
PDF
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
PDF
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover.hdp2.2.storm and kafka.final
Introduction to the Hortonworks YARN Ready Program
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop

What's hot (20)

PDF
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
PDF
Discover.hdp2.2.h base.final[2]
PDF
Hortonworks - What's Possible with a Modern Data Architecture?
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
PDF
Supporting Financial Services with a More Flexible Approach to Big Data
PPTX
Hortonworks Yarn Code Walk Through January 2014
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
PDF
Splunk-hortonworks-risk-management-oct-2014
PPTX
YARN Ready: Integrating to YARN with Tez
PDF
Combine SAS High-Performance Capabilities with Hadoop YARN
PDF
Enterprise Hadoop with Hortonworks and Nimble Storage
PDF
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
PDF
Hortonworks and Platfora in Financial Services - Webinar
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
PDF
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
PPTX
State of the Union with Shaun Connolly
PDF
Delivering Apache Hadoop for the Modern Data Architecture
PDF
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
PPTX
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover.hdp2.2.h base.final[2]
Hortonworks - What's Possible with a Modern Data Architecture?
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks Yarn Code Walk Through January 2014
Hp Converged Systems and Hortonworks - Webinar Slides
Splunk-hortonworks-risk-management-oct-2014
YARN Ready: Integrating to YARN with Tez
Combine SAS High-Performance Capabilities with Hadoop YARN
Enterprise Hadoop with Hortonworks and Nimble Storage
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Hortonworks and Platfora in Financial Services - Webinar
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
State of the Union with Shaun Connolly
Delivering Apache Hadoop for the Modern Data Architecture
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Ad

Viewers also liked (20)

PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
PPTX
Hortonworks Technical Workshop: HBase For Mission Critical Applications
PPTX
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
PPTX
Falcon Meetup
PDF
HPC Storage and IO Trends and Workflows
PPTX
Importing data in Oasis Montaj
PPTX
Hadoop first ETL on Apache Falcon
PPTX
2016 Cybersecurity Analytics State of the Union
PPTX
YARN Ready - Integrating to YARN using Slider Webinar
PDF
Hortonworks Technical Workshop - build a yarn ready application with apache ...
PPTX
Developing YARN Applications - Integrating natively to YARN July 24 2014
PPTX
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
PDF
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
PPTX
Get Started Building YARN Applications
PDF
Dataguise hortonworks insurance_feb25
PDF
Hortonworks sqrrl webinar v5.pptx
PDF
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Hortonworks Technical Workshop: HBase For Mission Critical Applications
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
Falcon Meetup
HPC Storage and IO Trends and Workflows
Importing data in Oasis Montaj
Hadoop first ETL on Apache Falcon
2016 Cybersecurity Analytics State of the Union
YARN Ready - Integrating to YARN using Slider Webinar
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Developing YARN Applications - Integrating natively to YARN July 24 2014
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Get Started Building YARN Applications
Dataguise hortonworks insurance_feb25
Hortonworks sqrrl webinar v5.pptx
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Ad

Similar to Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop (20)

PPTX
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
PPTX
PDF
Discover.hdp2.2.ambari.final[1]
PDF
Discover hdp 2.2 hdfs - final
PDF
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
PDF
Storm Demo Talk - Colorado Springs May 2015
PPTX
Introduction to the Hadoop EcoSystem
PPTX
Cloud Austin Meetup - Hadoop like a champion
PDF
Hadoop Present - Open Enterprise Hadoop
PDF
Azure Cafe Marketplace with Hortonworks March 31 2016
PPTX
Hadoop crashcourse v3
PPTX
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
PPTX
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
PPTX
Hadoop In Action
PPTX
Enabling Modern Application Architecture using Data.gov open government data
PPTX
Supporting Financial Services with a More Flexible Approach to Big Data
PPTX
Don't Let Security Be The 'Elephant in the Room'
PDF
Hortonworks and Red Hat Webinar - Part 2
PDF
How YARN Enables Multiple Data Processing Engines in Hadoop
PPTX
Hadoop crash course workshop at Hadoop Summit
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Discover.hdp2.2.ambari.final[1]
Discover hdp 2.2 hdfs - final
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Storm Demo Talk - Colorado Springs May 2015
Introduction to the Hadoop EcoSystem
Cloud Austin Meetup - Hadoop like a champion
Hadoop Present - Open Enterprise Hadoop
Azure Cafe Marketplace with Hortonworks March 31 2016
Hadoop crashcourse v3
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Hadoop In Action
Enabling Modern Application Architecture using Data.gov open government data
Supporting Financial Services with a More Flexible Approach to Big Data
Don't Let Security Be The 'Elephant in the Room'
Hortonworks and Red Hat Webinar - Part 2
How YARN Enables Multiple Data Processing Engines in Hadoop
Hadoop crash course workshop at Hadoop Summit

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
PDF
HDF 3.2 - What's New
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
PDF
Premier Inside-Out: Apache Druid
PDF
Accelerating Data Science and Real Time Analytics at Scale
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
PDF
Making Enterprise Big Data Small with Ease
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
PDF
Driving Digital Transformation Through Global Data Management
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Johns Hopkins - Using Hadoop to Secure Access Log Events
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
HDF 3.2 - What's New
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
IBM+Hortonworks = Transformation of the Big Data Landscape
Premier Inside-Out: Apache Druid
Accelerating Data Science and Real Time Analytics at Scale
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Making Enterprise Big Data Small with Ease
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Driving Digital Transformation Through Global Data Management
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Unlock Value from Big Data with Apache NiFi and Streaming CDC

Recently uploaded (20)

PDF
AI in Product Development-omnex systems
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPT
Introduction Database Management System for Course Database
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
medical staffing services at VALiNTRY
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Introduction to Artificial Intelligence
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administraation Chapter 3
PPTX
Transform Your Business with a Software ERP System
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
AI in Product Development-omnex systems
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Introduction Database Management System for Course Database
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
medical staffing services at VALiNTRY
Odoo Companies in India – Driving Business Transformation.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo POS Development Services by CandidRoot Solutions
Understanding Forklifts - TECH EHS Solution
How to Choose the Right IT Partner for Your Business in Malaysia
2025 Textile ERP Trends: SAP, Odoo & Oracle
Introduction to Artificial Intelligence
How Creative Agencies Leverage Project Management Software.pdf
PTS Company Brochure 2025 (1).pdf.......
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administraation Chapter 3
Transform Your Business with a Software ERP System
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

  • 1. Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Falcon for Data Governance in Hadoop Hortonworks. We do Hadoop.
  • 2. Page 2 © Hortonworks Inc. 2014 Speakers Justin Sears Hortonworks Product Marketing Manager Himanshu Bari Hortonworks Senior Product Manager & PM for Apache Falcon & Apache Storm in Hortonworks Data Platform Venkatesh Seetharam Foundational Hadoop Architect, Engineer & Committer for Apache Falcon and Apache Knox Gateway projects
  • 3. Page 3 © Hortonworks Inc. 2014 Agenda •  Why You Need Apache Falcon •  Key New Falcon Features •  Demo –  Defining data pipelines –  Policies for retention –  Managing Falcon server with Apache Ambari
  • 4. Page 4 © Hortonworks Inc. 2014 OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test A Modern Data Architecture APPLICATIONS  DATA    SYSTEM   REPOSITORIES   RDBMS   EDW   MPP   Business     Analy<cs   Custom  Applica<ons   Packaged   Applica<ons   Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social  Networks   Machine   Generated   Sensor   Data   GeolocaCon  Data  
  • 5. Page 5 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera<ng  System   DATA    MANAGEMENT   DATA    ACCESS   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox  
  • 6. Page 6 © Hortonworks Inc. 2014 NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   Script     Pig       Search     Solr       HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   DATA    MANAGEMENT   OPERATIONS   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   YARN  :  Data  Opera<ng  System   DATA    ACCESS   SQL     Hive/Tez,   HCatalog       Batch     Map   Reduce       Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   GOVERNANCE  &   INTEGRATION  
  • 7. Page 7 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  • 8. Page 8 © Hortonworks Inc. 2014 Simple Data Pipeline in Hadoop Relatively simple Oozie workflow Job1 Job2 JobN Job3 Has a Simple data pipeline Raw Data Clean Data Prepped Data HDFS data lake MR/Pig/Hive BI TOOLS Data Sources MR/Pig/Hive
  • 9. Page 9 © Hortonworks Inc. 2014 Quickly Gets Complicated…. Data stewards •  Impact analysis •  Monitor pipeline •  Track ownership •  Late data & failure handling Compliance teams •  Audit •  Retention •  Eviction IT admins •  Monitor infra •  Replication •  Archival Business & data analysts •  Verify data quality Manually write & wire Multiple complex Oozie workflows Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Other Hadoop tools Eg. DistCp Typical data governance requirements Raw Clean Prep
  • 10. Page 10 © Hortonworks Inc. 2014 Apache Falcon to the Rescue Data pipeline Raw Clean Prep Defined in Auto generate & orchestrate Adds the required data governance features Falcon adds the required data governance features DEFINITION Replication | Retention Eviction | Late data MONITORING TRACING Audit | Lineage Tagging Multiple complex Oozie workflows Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Other Hadoop ecosystem tools Eg. DistCp
  • 11. Page 11 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  • 12. Page 12 © Hortonworks Inc. 2014 Falcon Basic Concepts • Feed: Defines a “dataset” so a.k.a ‘datasets’ • Process: Consumes feeds, invokes processing logic & produces feeds All these put together represent ‘Data Pipelines’ in Hadoop CLUSTER FEED aka DATASET PROCESS INPUT TO CREATES • Cluster: : Represents the “interfaces” to a Hadoop cluster
  • 13. Page 13 © Hortonworks Inc. 2014 Data Pipeline Definition XML based pipeline specification Modular - Clusters, feeds & processes defined separately and then linked together Easy to re-use across multiple pipelines Out of the box policies Predefined policies for replication, retention & late data handling Easily customization of policies Extensible Plug in external solutions at any step of the pipeline Eg. Invoke third party data obfuscation components
  • 14. Page 14 © Hortonworks Inc. 2014 Replication & Retention Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only •  Sophisticated retention policies expressed in one place •  Simplify data retention for audit, compliance, or for data re-processing
  • 15. Page 15 © Hortonworks Inc. 2014 Data Pipeline Monitoring DATA Primary site DR site Centralized monitoring of data pipeline with Falcon + Ambari Pipeline run alerts Hadoop Cluster-1 Hadoop Cluster-2 Pipeline run history Pipeline scheduling raw clean prep raw clean prep
  • 16. Page 16 © Hortonworks Inc. 2014 Data Pipeline Tracing . Purchase feed Customer feed Product feedStore feed View dependencies between clusters, datasets and processes Data pipeline dependencies Add arbitrary tags to feeds & processes Credit feed Sensitive encrypted Data pipeline tagging Know who modified a dataset when and into what Data pipeline audits File-1 File-2 File-3 Analyze how a dataset reached a particular state Data pipeline lineage
  • 17. Page 17 © Hortonworks Inc. 2014 Falcon User Flow Create cluster entity & process XML specifications Validate and save specifications to HDFS Kick off Feeds & processes Schedule “Instances” of feeds & process to run Ensure feeds & processes run as expected Update feeds & processes as needed User Falcon Server Falcon CLI or API Define pipeline Deploy pipeline Manage pipeline ‘instance’ suspend, resume, kill SCHEDULESUBMIT
  • 18. Page 18 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  • 19. Page 19 © Hortonworks Inc. 2014 Falcon Architecture Centralized Falcon Orchestration Framework Hadoop ecosystem tools Falcon  Server   JMS   API   &   UI   AMBARI   HDFS / Hive Oozie Entity Specs Scheduled Jobs Process Status MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  • 20. Page 20 © Hortonworks Inc. 2014 Clickstream enrichment data pipeline Use case description •  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../ {date}). •  Cluster is located in the Oregon data center. •  Data arrives from all NA-west-coast production servers. •  The input data feeds are often late for up to 4 hrs. •  We need to enrich the clickstream data with Ad impression metadata and make it available to our marketing data science team for customer segmentation analysis. •  Primary Hadoop cluster does not need the raw and enriched click data after 3 months. •  Our IT policy requires us to backup all enriched click data and store it for 3 years in our secondary Hadoop cluster in the Virginia data center.
  • 21. Page 21 © Hortonworks Inc. 2014 Falcon Entity Relationships CLICKSTREAM ENRICHMENT PIPELINE Clicks DATASET Enriched clicks DATASET Click enrichment PROCESSClicks ingest PROCESS Oregon Hadoop cluster PRIMARY CLUSTER Virginia Hadoop cluster BACKUP CLUSTER Creates Runson Storedon Backup to Create Impressions ingest PROCESS Creates Impressions DATASET Runson
  • 22. Page 22 © Hortonworks Inc. 2014 Learn More About Data Governance in Hadoop Hortonworks.com/labs/data-management/ Register for the remaining 4 Discover HDP 2.1 Webinars Hortonworks.com/webinars Next Webinar: Apache Hadoop 2.4.0, YARN and HDFS Wednesday, May 28, 9am Pacific
  • 23. Page 23 © Hortonworks Inc. 2014 Thank you!