SlideShare a Scribd company logo
Apache Atlas:
Why Big Data Management
Requires Hierarchical Taxonomies
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development,
may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback and
the overarching Apache Software Foundation community development process can all effect timing
and final delivery.
This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not
rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speakers
Andrew Ahn
Governance Director
Product Management
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Atlas Overview
• Near term roadmap
• Taxonomy Benefits
• Questions
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DGI* Community becomes Apache Atlas
May
2015
Proto-type
Built
Apache
Atlas
Incubation
DGI group
Kickoff
Feb
2015
Dec
2014
July
2015
HDP 2.3
Foundation
GA Release
First kickoff to GA in 7 months
Global Financial
Company
* DGI: Data Governance Initiative
Faster & Safer
Co-Development driven
by customer use cases
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platfroms
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Project 1
Project 5
Project 4
Project 3
Metadata
Project 6
DATA
LAKE
Atlas: Metadata Truth in Hadoop
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
Modeling with Metadata
enables comprehensive data lineage through a
hybrid approach with enhanced tagging and
attribute capabilities
Interoperable Solutions
across the Hadoop ecosystem, through a common
metadata store
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas: Metadata Services
• Cross- component dataset
lineage. Centralized location for
all metadata inside HDP
• Single Interface point for
Metadata Exchange with
platforms outside of HDP
• Business Taxonomy based
classification. Conceptual,
Logical And Technical
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through Metadata
Management Scalability
Many traditional tools and patterns do not scale when applied to multi-tenant data lakes.
Many enterprise have silo’d data and metadata stores that collide in the data lake. This is
compounded by the ability to have very large windows (years). Can traditional EDW tools
manage 100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata is the only via solution.
This allows quick integration with automation and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship, attribute based
security and self-service.
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon Others
REST API
Graph DB
Search
Kafka
Sqoop
Connectors
MessagingFramework
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomies Benefits:
• Discovery – Business catalog of conceptual,
logical and physical assets
• Security --Dynamic metadata based Access
control
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap:
Summer 2016
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Catalog
Breadcrumbs for
taxonomy context path
Contents at
taxonomy context
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Technical and Logical Metadata Exchange
Knowledge
Store
Atlas
REST API
Structured
Unstructured
Files:
XML / JSON
3rd Party
Vendors
Custom
Reporter
Non-Hadoop Taxonomy
Data Lineage
Technical Metadata
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self Service
Visualization
Curated: Selected group of vendor partners to provide rich,
complimentary and complete features
Choice: Customers choose features that they want to
deploy—a la carte versus vendor lock
Agile: Low switching costs, Faster deployement and
innovation
Standard: Common SLA & common open metadata store
Flexibility: Interoperability of products through Atlas
metadata
HDP at core to provide stability and interoperability
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy Inheritance
Human
Resources
Drivers
(Dimension)
Timesheets
(Facts)
PII
PIIPII
Parent
ChildChild
Logical
Business
Taxonomy
Data
Assets
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy
Apache Ranger + Atlas Integration
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy Driven by metadata
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Tag-based Access Policy Requirements
• Basic Tag policy – PII example. Access and entitlements must be tag
based ABAC and scalable in implementation.
• Geo-based policy – Policy based on IP address, proxy IP substitution
maybe required. The rule enforcement but be geo aware.
• Time-based policy – Timer for data access, de-coupled from deletion
of data.
• Prohibitions – Prevention of combination of Hive tables that may
pose a risk together.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?
Atlas provides: Metadata
• Business Classification (taxonomy): Company > HR > Driver
• Hierarchy with Inheritance of attribute to child objects: Sensitive
“PII” tag of department HR will be inherited by group HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hive
Ranger
Falcon
Kafka
Storm
Atlas provides the
metadata tag to
create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use cases drives design – high reliability
Metastore
• Tags
• Assets
• Entities
Notification
Framework
Kafka Topics
Atlas
Atlas Client
• Subscribes to
Topic
• Gets Metadata
Updates
PDP
Resource Cache
Ranger
Notification Metadata
updates
Message
durability
Optimized
for Speed
Event driven
updates
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Security
• Discovery & Lineage
Preview Demo
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Availability:
- Tech Preview VMs: May 2016
- GA Release: Summer 2016
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions ?
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reference
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Online Resources
VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP-
Atlas-Ranger-TP.ova —> Download Public Preview VM
Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger-
tp/tutorials/hortonworks/atlas-ranger-preview
Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of-
hadoop-based-security-data-governance/ (this is giving an error, right
now)
Learn More: http://hortonworks.com/solutions/atlas-ranger-
integration/
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Tag Based Security Video:
https://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view?usp=sharing
https://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view
?usp=sharing
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF: Dataflow Governance Solution
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Dataflow Security Use case Requirements
Accelerated Data Collection: An
integrated, data source agnostic
collection platform
Increased Security and
Unprecedented Chain of Custody:
Secure from source to storage with
high fidelity data provenance
The Internet of Any Thing (IoAT): A
Proven Platform for the Internet of
Things
http://hortonworks.com/hdf/
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Grade Governance Dataflow Solution
Filtered
Metadata
• HDP Taxonomy
• Centrallized
Metadata
Repository
• Downstream HDP
Impacts
• Cross component
lineage
• 3rd Party
integration
• Guaranteed
Delivery
• Data Buffering
• Prioritized
Queueing
• Flow specific QoS
• Visual Command
& Control
Months
Lineage
Years
Lineage
Reference
Taxonomy
(Tags)
Event level
versus Dataset
level
HDF - NiFI
Operation
Control
Maximum
Fidelity
Event Level
HDP – Atlas
Governance
Management
Medium / Low
Fidelity
Dataset Level
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Expanded visibility throughout the eco-system
HDF
ETL
Hive
Hive Hook
(Native)
Security
Appliance
Data
Metadata
NiFi
NiFi
NiFi
NiFi
Kafka
Hive Hook
(Native)
Hive
Hive Hook
(Native)
HDP
Atlas
Metadata
Repository
Centralized
Repository for
multiple NiFi
Deployments
End to end
data lineage
Security
Appliance
Security
Appliance
Security
Appliance
Security
Appliance
Security
Appliance

More Related Content

PPTX
Apache Atlas: Governance for your Data
PDF
Building an open data platform with apache iceberg
PPTX
Securing Hadoop with Apache Ranger
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PDF
Apache Hadoop YARNとマルチテナントにおけるリソース管理
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PPTX
Druid deep dive
Apache Atlas: Governance for your Data
Building an open data platform with apache iceberg
Securing Hadoop with Apache Ranger
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Druid deep dive

What's hot (20)

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Considerations for Data Access in the Lakehouse
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
PDF
Iceberg: a fast table format for S3
PPTX
Free Training: How to Build a Lakehouse
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PPTX
Inside open metadata—the deep dive
PDF
Apache NiFi Record Processing
PDF
Hadoop and Kerberos
PPTX
Talend Data Quality
PDF
Data catalog
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Security and Data Governance using Apache Ranger and Apache Atlas
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Considerations for Data Access in the Lakehouse
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Iceberg: a fast table format for S3
Free Training: How to Build a Lakehouse
Architect’s Open-Source Guide for a Data Mesh Architecture
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Apache Iceberg - A Table Format for Hige Analytic Datasets
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Compression Options in Hadoop - A Tale of Tradeoffs
Inside open metadata—the deep dive
Apache NiFi Record Processing
Hadoop and Kerberos
Talend Data Quality
Data catalog
Apache Tez - A New Chapter in Hadoop Data Processing
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Ad

Viewers also liked (20)

PDF
Data Governance - Atlas 7.12.2015
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
PDF
Manage tracability with Apache Atlas, a flexible metadata repository
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
PDF
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
PDF
Apache Atlas. Data Governance for Hadoop. Strata London 2015
PPTX
Atlas and ranger epam meetup
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PPTX
Data Discovery & Lineage in Enterprise Hadoop
PPTX
Dynamic Column Masking and Row-Level Filtering in HDP
PPTX
Apache Ambari Meetup - AMS & Grafana
PPTX
2015 Automic Automation Heroes
PPTX
Data Discovery on Hadoop - Realizing the Full Potential of your Data
PDF
Apache Falcon at Hadoop Summit 2013
PPTX
Apache Falcon at Hadoop Summit Europe 2014
PDF
빅데이터 네트워크 분석 노드엑셀 따라잡기 보도자료
PDF
[G6]hadoop이중화왜하는거지
Data Governance - Atlas 7.12.2015
Apache Atlas: Tracking dataset lineage across Hadoop components
Manage tracability with Apache Atlas, a flexible metadata repository
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Implementing a Data Lake with Enterprise Grade Data Governance
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Apache Atlas. Data Governance for Hadoop. Strata London 2015
Atlas and ranger epam meetup
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Is your Enterprise Data lake Metadata Driven AND Secure?
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Data Discovery & Lineage in Enterprise Hadoop
Dynamic Column Masking and Row-Level Filtering in HDP
Apache Ambari Meetup - AMS & Grafana
2015 Automic Automation Heroes
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Apache Falcon at Hadoop Summit 2013
Apache Falcon at Hadoop Summit Europe 2014
빅데이터 네트워크 분석 노드엑셀 따라잡기 보도자료
[G6]hadoop이중화왜하는거지
Ad

Similar to Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies (20)

PPTX
Classification based security in Hadoop
PPTX
Enterprise Data Classification and Provenance
PPTX
HDP Next: Governance
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
PPTX
What the #$* is a Business Catalog and why you need it
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
PPTX
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
PPTX
Unleashing the power of apache atlas with apache - virtual dataconnector
PPTX
Building a data-driven authorization framework
PPTX
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
PPTX
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
PPTX
Big data spain keynote nov 2016
PDF
Meetup oslo hortonworks HDP
PDF
Hortonworks Hadoop @ Oslo Hadoop User Group
PPTX
Data Governance Initiative
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Tag based policies using Apache Atlas and Ranger
PPTX
GDPR Community Showcase for Apache Ranger and Apache Atlas
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
An Apache Hive Based Data Warehouse
Classification based security in Hadoop
Enterprise Data Classification and Provenance
HDP Next: Governance
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
What the #$* is a Business Catalog and why you need it
Hortonworks Hybrid Cloud - Putting you back in control of your data
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
Unleashing the power of apache atlas with apache - virtual dataconnector
Building a data-driven authorization framework
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
Big data spain keynote nov 2016
Meetup oslo hortonworks HDP
Hortonworks Hadoop @ Oslo Hadoop User Group
Data Governance Initiative
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Tag based policies using Apache Atlas and Ranger
GDPR Community Showcase for Apache Ranger and Apache Atlas
Hive edw-dataworks summit-eu-april-2017
An Apache Hive Based Data Warehouse

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025

Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

  • 1. Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Speakers Andrew Ahn Governance Director Product Management
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • Atlas Overview • Near term roadmap • Taxonomy Benefits • Questions
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Overview
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DGI* Community becomes Apache Atlas May 2015 Proto-type Built Apache Atlas Incubation DGI group Kickoff Feb 2015 Dec 2014 July 2015 HDP 2.3 Foundation GA Release First kickoff to GA in 7 months Global Financial Company * DGI: Data Governance Initiative Faster & Safer Co-Development driven by customer use cases
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved STRUCTURED UNSTRUCTURED Vision - Enterprise Data Governance Across Platfroms TRADITIONAL RDBMS METADATA MPP APPLIANCES Project 1 Project 5 Project 4 Project 3 Metadata Project 6 DATA LAKE Atlas: Metadata Truth in Hadoop Data Management along the entire data lifecycle with integrated provenance and lineage capability Modeling with Metadata enables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilities Interoperable Solutions across the Hadoop ecosystem, through a common metadata store
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas: Metadata Services • Cross- component dataset lineage. Centralized location for all metadata inside HDP • Single Interface point for Metadata Exchange with platforms outside of HDP • Business Taxonomy based classification. Conceptual, Logical And Technical Apache Atlas Hive Ranger Falcon Sqoop Storm Kafka Spark NiFi
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Management Through Metadata Management Scalability Many traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ? Metadata Tools Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodels Tags for Management, Discovery and Security Proper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas High Level Architecture Type System Repository Search DSL Bridge Hive Storm Falcon Others REST API Graph DB Search Kafka Sqoop Connectors MessagingFramework
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxonomies Benefits: • Discovery – Business catalog of conceptual, logical and physical assets • Security --Dynamic metadata based Access control
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Near Term Roadmap: Summer 2016
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Expanded Native Connector: Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Catalog Breadcrumbs for taxonomy context path Contents at taxonomy context
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Technical and Logical Metadata Exchange Knowledge Store Atlas REST API Structured Unstructured Files: XML / JSON 3rd Party Vendors Custom Reporter Non-Hadoop Taxonomy Data Lineage Technical Metadata
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Governance Ready Certification Program Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visualization Curated: Selected group of vendor partners to provide rich, complimentary and complete features Choice: Customers choose features that they want to deploy—a la carte versus vendor lock Agile: Low switching costs, Faster deployement and innovation Standard: Common SLA & common open metadata store Flexibility: Interoperability of products through Atlas metadata HDP at core to provide stability and interoperability
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Taxonomy Inheritance Human Resources Drivers (Dimension) Timesheets (Facts) PII PIIPII Parent ChildChild Logical Business Taxonomy Data Assets
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Apache Ranger + Atlas Integration
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Driven by metadata
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Tag-based Access Policy Requirements • Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation. • Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement but be geo aware. • Time-based policy – Timer for data access, de-coupled from deletion of data. • Prohibitions – Prevention of combination of Hive tables that may pose a risk together.
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Atlas work with Ranger at scale? Atlas provides: Metadata • Business Classification (taxonomy): Company > HR > Driver • Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver • Atlas will notify Ranger via Kafka Topic for changes Apache Atlas Hive Ranger Falcon Kafka Storm Atlas provides the metadata tag to create policies Ranger provides: Access & Entitlements • Ranger will cache tags and asset mapping for performance • Ranger will have a policy based on tags instead of roles. • Example: PII = <group> This can work for a may assets.
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use cases drives design – high reliability Metastore • Tags • Assets • Entities Notification Framework Kafka Topics Atlas Atlas Client • Subscribes to Topic • Gets Metadata Updates PDP Resource Cache Ranger Notification Metadata updates Message durability Optimized for Speed Event driven updates
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved • Security • Discovery & Lineage Preview Demo
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Availability: - Tech Preview VMs: May 2016 - GA Release: Summer 2016
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions ?
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reference
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Online Resources VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP- Atlas-Ranger-TP.ova —> Download Public Preview VM Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger- tp/tutorials/hortonworks/atlas-ranger-preview Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of- hadoop-based-security-data-governance/ (this is giving an error, right now) Learn More: http://hortonworks.com/solutions/atlas-ranger- integration/
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Tag Based Security Video: https://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view?usp=sharing https://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view ?usp=sharing
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDF: Dataflow Governance Solution
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Dataflow Security Use case Requirements Accelerated Data Collection: An integrated, data source agnostic collection platform Increased Security and Unprecedented Chain of Custody: Secure from source to storage with high fidelity data provenance The Internet of Any Thing (IoAT): A Proven Platform for the Internet of Things http://hortonworks.com/hdf/
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Grade Governance Dataflow Solution Filtered Metadata • HDP Taxonomy • Centrallized Metadata Repository • Downstream HDP Impacts • Cross component lineage • 3rd Party integration • Guaranteed Delivery • Data Buffering • Prioritized Queueing • Flow specific QoS • Visual Command & Control Months Lineage Years Lineage Reference Taxonomy (Tags) Event level versus Dataset level HDF - NiFI Operation Control Maximum Fidelity Event Level HDP – Atlas Governance Management Medium / Low Fidelity Dataset Level
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Expanded visibility throughout the eco-system HDF ETL Hive Hive Hook (Native) Security Appliance Data Metadata NiFi NiFi NiFi NiFi Kafka Hive Hook (Native) Hive Hive Hook (Native) HDP Atlas Metadata Repository Centralized Repository for multiple NiFi Deployments End to end data lineage Security Appliance Security Appliance Security Appliance Security Appliance Security Appliance

Editor's Notes

  • #2: TALK TRACK Data is powering successful clinical care and successful operations. [NEXT SLIDE]
  • #7: How fast ? 7 months !
  • #8: 7
  • #11: Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following: Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
  • #14: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #16: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #17: Which Vendors would you be interested in ?
  • #21: The point of Atlas is to leverage metadata to drive exchange, agility and scalability in the HDP gov solution.   The paradigm shift requires that in a true data lake with multi-tenant environment with 10K+ of objects, conventional management of entitlement and enforcement will not work and new patterns must be used.   One group cannot both understand the data and manage policy efficiently — the domain is too large.  These activities must be de-coupled.   The data stewards curate the data as they are the SMEs (tagging), and the policy folks create a policy once based on tags (access rules).    In our thinking, this the ONLY scalable solution.   We have it and CDH does not.
  • #22: Apache Atlas = low level service like yarn. It will be common to the whole HDP platform, providing core metadata services and enriching the whole HDP stack. We start with Hive in HDP 2.3 and will extend to Ranger and Falcon in M10 and continue with Kafka and Storm by the end of 2015. Yellow + Atlas = governance features.
  • #23: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #34: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagonsis ** bring meta from external systems into hadoop – keep it together