SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Data
Classification and
Provenance
Apache Atlas
Shwetha Shivalingamurthy
Suma Shivaprasad
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be
under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation
project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release
through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache
Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual
commitment, promise or obligation from Hortonworks to deliver these features in any generally available
product.
Product features and technology directions are subject to change, and must not be included in contracts,
purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it
when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Demo
• Big Data Governance
• Overview of Atlas
• Atlas architecture
• Features and Roadmap
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo usecase – Ad network
• Matches advertiser demand with ad space supply from publishers
• Billing based on ad impressions/ad engagement
• Enables targeting, tracking and reporting of ad impressions
• Typical reports/queries:
• Mismatch of demand and supply
• Country/os wise reports
• Top advertisers/publishers
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data landscape
Traditional
warehouse
Ad servers
User
Ad
Impression,
Click,
Billing logs
Metadata
Summaries
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data governance requirements
• Cross platform lineage – impact analysis, forensic, discovery
• Asset search
• Common Business Terms
• Compliance
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
• Technical and business metadata
• Cross Component Lineage
• Creating views
• Create tags
• Entity deletes
• Search using tags, attributes
• Entity audit
• Business catalog – find assets
• Flexible model, external lineage ingest
HDP 2.5
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data
Governance
Data
Discovery
and
Tagging
Metadata
Management
Data
Lineage/Prov
enance
Access
Management
Data Security &
PrivacyData Quality
Compliance and
Audit
Data Wrangling
Data Lifecycle
Management
Data integration
Data Governance Aspects
Data governance refers to
processes, methods and tools
used in an enterprise
for effective control of
availability, usability, integrity,
and security of data
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Data Governance: Apache Atlas
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
• Cross component lineage
Modeling with Metadata
enables comprehensive business metadata
vocabulary with enhanced tagging and attribute
capabilities
• Common Business Language
• Hierarchically organized – No dupes !
Interoperable Solutions
across the Hadoop ecosystem, through a common
metadata store
• Combine and Exchange Metadata
STRUCTURED
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Kafka Storm
Sqoop
Hive
ATLAS
METADATA
Falcon
RANGER
Custom
Partners
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Background: DGI Community becomes Apache Atlas
May
2015
Apache
Atlas
Incubation
DGI group
Kickoff
Dec
2014
Aug
2016
HDP 2.5/
Apache 0.7
Release
Global Financial
Company
* DGI: Data Governance Initiative
Key Benefits:
• Co-Dev = Built for real
customer use cases
• Faster & Safer =
Customers know
business + HWX
knows Hadoop
• Code contributors
- Hortonworks, IBM,
Aetna , Merck, Target
Jul
2015
HDP 2.3/
Apache 0.5
Foundation
Release
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Type System
• Defines model – schema of metadata
• Flexible and powerful to define any model/custom types
• Supports inheritance
• Types
• Primitive types – bool, integer types, string, date, enum
• Collections - array, map
• Struct – set of attributes
• Class – Identifiable struct, hierarchy
• Trait – set of attributes, hierarchy
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive Model
DataSet
metaType: ClassType
name: String required
hive_db
metaType: ClassType
name: string required
createTime: date required
parameters: map<string,string> optional
hive_table
metaType: ClassType
db: hive_ db required
createTime: date required
columns: array<hive_column>
required
hive_column
metaType: ClassType
name: string required
type: string required
extends references
references
0..n
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Entities
Instances of types
Name: rawlogs
Guid: 1
createTime: 2015-01-01 10:00
Type: hive_db
name: impressions
Guid: 2
Type: hive_table
name: adv_id
type: string
Guid: 3
Type: hive_column
name: user_id
type: string
Guid: 4
Type: hive_column
db column
column
EXPIRES_ON
Time: March, 2016
PII
trait
trait
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Graph Engine
• Graph Database
• Titan with storage backed by HBase
• Types and Entities are translated to the Graph Model
• Classes, Structs and Traits map to a vertex
• Relationships are mapped as edges
• Rich relationships between metadata objects
• Indexing and Search
• Indexing based on type annotations
• External indexing – Titan backed by Solr
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Titan property graph model
Graph Search with Gremlin
saturn =
g.V.has('name','saturn').next()
hercules =
saturn.as(‘x’).in(‘father’).loop(‘x’) {
it.loops > 3}.next()
hercules.outE(‘battled’).has(‘time’,
T.gt, 1).inV.name
cerberus
 hydra
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Search
Find Relevant Assets
based on their attributes ,
associations with business terms
DSL with sql like syntax based on type system
from $type is $trait where $clause select|has
$attributes, repeat
Examples
 Select columns from a hive_table where its name
is “impressions” and db name is “raw”
hive_column where table.name=”impressions",
table.db.name = ‘raw’
 Select all columns from hive tables which are
tagged as “PII”
hive_column is ‘PII’
Full text search
‘(rawlogs) AND hive’
‘(rawlogs OR supply*) AND hive’
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Features and Roadmap
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Component Integration & Lineage
• Cross- component dataset lineage.
Centralized location for all metadata
inside HDP
• Single Interface point for Metadata
Exchange with platforms outside of
HDP
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
HBase
Partner
Custom
HDP 2.3
HDP 2.5 Beyond HDP 2.5
HDP 2.5 External
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Catalog for Ease of Use
 Organize data assets along business terms
– Authoritative: Hierarchical Taxonomy Creation
– Agile modeling: Model Conceptual, Logical, Physical assets
– Definition and assignment of tags like PII (Personally
Identifiable Information)
 Comprehensive features for compliance
– Multiple user profiles including Data Steward and Business
Analysts
– Object auditing to track “Who did it”
– Metadata Versioning to track ”what did they do”
 Faster Insight: ( Roadmap )
– Data Quality tab for profiling and sampling
– User Comments
Key Benefits:
Organize data assets along
business terms
Compliance Features:
Faster Insight
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Introduction
Centralized authorization and auditing across Hadoop components
• HDFS, Hive, HBase, Knox, Strom, YARN, Kafka, Solr, ..
• Audit logs to: Solr, HDFS, RDBMS, Log4j, ..
Resource based security
• Policies for specific set of resources
• Requires revision of policies as resources get added/moved
Classification based security
• Policies for classifications and not for specific resources
• A single policy protects resources in multiple components
• As classification for resources change, appropriate policies would
automatically be applied
• Enables separation of duties: resource-classification and security policies
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Access Control – Reusable Tag Policy
User group
• AD
• Linux
Resources:
• Files
• Tables
• Topologies
Atlas Tag
• PII
ANY asset PII
• Files
• Tables
• Topologies
Single Admin Group
Assigns
Many Stewards Tag +
Single point of
enforcement and
audit
All future tagging
is covered by
existing policy
Not Scalable
Scalable
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Open: Governance Ready Certification Program
Choice: Customers choose features that they want to
deploy—a la carte versus vendor lock
Curated & Fast: Selected group of vendor partners to
provide rich, complimentary and complete features ready
to deploy
Agile: Low switching costs, Faster deployment and
innovation
Centralized : Common SLA & common open metadata
store
Flexibility: Interoperability of products through Atlas
metadata
Safe: HDP at core to provide stability and interoperability
Completed:
• Waterline
• Dataguise
• Attivio
• Trifacta
Pending:
• Collibra
• Alation
• Meta
Integration
(Miti)
• Paxata
• Syncsort
• Talend
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Roadmap…
• MultiTenancy
• Titan 1.x Migration
• Hive Column Level Lineage
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
• Designed for Hadoop at platform, not application level
• High Confidence data in Hadoop for regulated verticals
• Compliance and business objectives aligned to data organization
• Faster discovery for analysts – reduce time to value
• Agile and adaptable – ensures information is current by native
connectors
• Dynamic protection with Ranger in simple audited policies
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn More:
• Apache Incubator link
http://atlas.incubator.apache.org/
• Hortonworks links: http://hortonworks.com/solutions/security-and-
governance/
• https://community.hortonworks.com/spaces/64/governance-lifecycle-
track.html?topics=Atlas&type=question
• Atlas Technical User Guide -
http://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy
Apache Ranger + Atlas Integration
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?
Atlas provides: Metadata
• Business Classification (taxonomy): Company > HR > Driver
• Hierarchy with Inheritance of attribute to child objects: Sensitive
“PII” tag of department HR will be inherited by group HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hive
Ranger
Falcon
Kafka
Storm
Atlas provides the
metadata tag to
create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Automatic update of policies – active protection
Metastore
• Tags
• Assets
• Entities
Notification
Framework
Kafka Topics
Atlas
Atlas Client
• Subscribes to
Topic
• Gets Metadata
Updates
PDP
Resource Cache
Ranger
Notification Metadata
updates
Message
durability
Optimized
for Speed
Event driven
updates
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Authorization and Auditing
HBase
Ranger Administration Portal
HDFS
Hive Server2
Ranger Audit StoreRanger Policy Store
Ranger Plugin
Hadoop
Components
Enterprise
Users
Log4j
Knox
Storm
YARN
Kafka
Solr
HDFS
Solr
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
RDBMS
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Governance
Current Landscape
• Opaque Data and in variety of data stores – HDFS, S3, Data warehouses
• Schema is hardly sufficient – Hive Metastore, Avro, Data Warehouse
• Platform tools like Ranger and Falcon solve parts of the problem
Need for Data governance
Organizations need data governance to understand its information to answer
questions such as:
• What do we know about our information?
• Where did this data come from and how’s it being used?
• Does this data adhere to company policies and rules?
• Need for effective control and consumption of data
Atlas helps customers discover information about data objects, their meaning,
location, characteristics, and usage.
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy
Business Taxonomy (Catalog)
The practice and science of classification of things or concepts,
including the principles that underlie such classification. The
business organization model is hierarchical making authoritative
with no duplication.
Tags: Traits vs. Labels vs. Business Taxonomy
Atlas has Tags that are authorative and prevent duplication. Tag
can span different parts of the business taxonomy. A tag PII can be
used in HR as well Finance or Sales.
Benefits:
A view of data assets organized
by business language
Compliance, Acceptable use –
Dynamic Metadata based access
control
Common taxonomy through
Hadoop components
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Principle Roles & Activities in an
Enterprise
• Data Steward – Curator, responsible
for data classification – associate
business taxonomy and tagging,
access policies
• Data Scientist – Analyst, primary
consumer of Business Taxonomy
• Administrator/Operations – Role
management, Data lifecycle
management (Archival, retention)
• Data Engineer – Data ingress and
egress, semantic data quality
• 50% - 80%+ Time
spend looking
for data
• Profit Center • Primary User
of Atlas
• Enables
Scientist
Goal: < 25% spent on
finding data
=
Empowering scientist to
spend their time
uncovering insights --
faster
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Governance Usecases : Impact analysis
 HortonAdNetwork – A large size Ad network which has an international footprint with multiple
publishers and advertisers across several countries
 Complex ETL jobs and data pipelines processing real-time ad network data from several different
sources and various data processing platforms
 No easy way to determine the root cause when something is off charts
 Data analysts need effective data provenance tools for Impact/Root cause anaylsis
 Cross component lineage is a must
 Data Lineage (Provenance)
Data lineage is defined as a data life cycle that includes the data's origins and where it moves over
time. It describes what happens to data as it goes through diverse processes. It helps provide visibility
into the analytics pipeline and simplifies tracing errors back to their sources
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Governance Usecases - Compliance
 HortoniaBank – mid size bank expanding from US to international markets
 2 Customer Tables owned by BH: 50K customer records each with 38 fields (PII, PHI,
PCI & non-sensitive data)
– us_customers: USA person data only
– ww_customers: multi-language, multi-country, localized person data
 1 data set of prospects leased from a data broker
– tax_2010: Data lease expired already!
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User Group Access Privileges
joe_analyst us_employee US Data Only, non-sensitive data only, rest forbidden
depending on sensitivity
kate_hr us_hr US Data Only, All sensitive data (PCI, PII, PHI)
Tag Based Policies
 US HR team members can see all original data (PCI, PII,….)
 Analysts are prohibited from viewing PII data in any of the tables
 Anyone except operations/Admin are prohibited to access tax_2010 after the specified
date - Expires_on policy turns off access on the configured expiry date
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT of
the box

More Related Content

PDF
PPTX
The Analytics CoE: Positioning your Business Analytics Program for Success
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PDF
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
PDF
8 Steps to Creating a Data Strategy
PDF
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
PDF
Data lake
PDF
DataOps - The Foundation for Your Agile Data Architecture
The Analytics CoE: Positioning your Business Analytics Program for Success
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
8 Steps to Creating a Data Strategy
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data lake
DataOps - The Foundation for Your Agile Data Architecture

What's hot (20)

PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Business Intelligence & Data Analytics– An Architected Approach
PPTX
Developing Data Products
PDF
Enterprise Architecture
PDF
IT4IT™ - Managing the Business of IT
PDF
Data Management vs. Data Governance Program
PPTX
Building a modern data warehouse
PPTX
Power BI for Big Data and the New Look of Big Data Solutions
PDF
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
PPTX
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
PDF
Apache Kafka® and the Data Mesh
PDF
Request to Fulfill Presentation (IT4IT)
PDF
Five Things to Consider About Data Mesh and Data Governance
PDF
Designing An Enterprise Data Fabric
PPTX
The Business Glossary, Data Dictionary, Data Catalog Trifecta
PDF
Data Strategy - Enabling the Data-Guided Enterprise
PPT
Data Architecture for Data Governance
PDF
Data Governance and Metadata Management
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
How to Build & Sustain a Data Governance Operating Model
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Business Intelligence & Data Analytics– An Architected Approach
Developing Data Products
Enterprise Architecture
IT4IT™ - Managing the Business of IT
Data Management vs. Data Governance Program
Building a modern data warehouse
Power BI for Big Data and the New Look of Big Data Solutions
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
"Introduction to FinOps" – Greg VanderWel at Chicago AWS user group
Apache Kafka® and the Data Mesh
Request to Fulfill Presentation (IT4IT)
Five Things to Consider About Data Mesh and Data Governance
Designing An Enterprise Data Fabric
The Business Glossary, Data Dictionary, Data Catalog Trifecta
Data Strategy - Enabling the Data-Guided Enterprise
Data Architecture for Data Governance
Data Governance and Metadata Management
Building the Data Lake with Azure Data Factory and Data Lake Analytics
How to Build & Sustain a Data Governance Operating Model
Ad

Viewers also liked (20)

PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PDF
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
PDF
Data Governance - Atlas 7.12.2015
PPTX
Modernise your EDW - Data Lake
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
PPTX
Apache Ranger
PPTX
Securing Hadoop with Apache Ranger
PDF
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
PPTX
Are you paying attention
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Findability Day 2016 - What is GDPR?
PPTX
Why is my Hadoop* job slow?
PPTX
Big Data at your Desk with KNIME
PPTX
PDF
DLAB company info and big data case studies
PDF
Pivotal HAWQ 소개
PPTX
Sql Stream Intro
PPTX
Scale-Out Resource Management at Microsoft using Apache YARN
PPTX
오픈소스 프로젝트 따라잡기_공개
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Security and Data Governance using Apache Ranger and Apache Atlas
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Data Governance - Atlas 7.12.2015
Modernise your EDW - Data Lake
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Apache Ranger
Securing Hadoop with Apache Ranger
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
Are you paying attention
LLAP: Sub-Second Analytical Queries in Hive
Findability Day 2016 - What is GDPR?
Why is my Hadoop* job slow?
Big Data at your Desk with KNIME
DLAB company info and big data case studies
Pivotal HAWQ 소개
Sql Stream Intro
Scale-Out Resource Management at Microsoft using Apache YARN
오픈소스 프로젝트 따라잡기_공개
Ad

Similar to Enterprise Data Classification and Provenance (20)

PPTX
HDP Next: Governance
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
PPTX
Apache Atlas: Governance for your Data
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
PPTX
Classification based security in Hadoop
PPTX
Data Governance Initiative
PPTX
Atlas and ranger epam meetup
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
PPTX
What the #$* is a Business Catalog and why you need it
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
PDF
Introduction to Hortonworks Data Platform
PPTX
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
An Apache Hive Based Data Warehouse
PPTX
Mrinal devadas, Hortonworks Making Sense Of Big Data
PDF
Meetup oslo hortonworks HDP
PDF
Hortonworks Hadoop @ Oslo Hadoop User Group
PDF
Fifth Elephant Apache Atlas Talk
HDP Next: Governance
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Governance for your Data
Is your Enterprise Data lake Metadata Driven AND Secure?
Classification based security in Hadoop
Data Governance Initiative
Atlas and ranger epam meetup
Hortonworks Hybrid Cloud - Putting you back in control of your data
What the #$* is a Business Catalog and why you need it
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Introduction to Hortonworks Data Platform
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Hive edw-dataworks summit-eu-april-2017
An Apache Hive Based Data Warehouse
Mrinal devadas, Hortonworks Making Sense Of Big Data
Meetup oslo hortonworks HDP
Hortonworks Hadoop @ Oslo Hadoop User Group
Fifth Elephant Apache Atlas Talk

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf

Enterprise Data Classification and Provenance

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Data Classification and Provenance Apache Atlas Shwetha Shivalingamurthy Suma Shivaprasad
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • Demo • Big Data Governance • Overview of Atlas • Atlas architecture • Features and Roadmap
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo usecase – Ad network • Matches advertiser demand with ad space supply from publishers • Billing based on ad impressions/ad engagement • Enables targeting, tracking and reporting of ad impressions • Typical reports/queries: • Mismatch of demand and supply • Country/os wise reports • Top advertisers/publishers
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data landscape Traditional warehouse Ad servers User Ad Impression, Click, Billing logs Metadata Summaries
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data governance requirements • Cross platform lineage – impact analysis, forensic, discovery • Asset search • Common Business Terms • Compliance
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo • Technical and business metadata • Cross Component Lineage • Creating views • Create tags • Entity deletes • Search using tags, attributes • Entity audit • Business catalog – find assets • Flexible model, external lineage ingest HDP 2.5
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance Data Discovery and Tagging Metadata Management Data Lineage/Prov enance Access Management Data Security & PrivacyData Quality Compliance and Audit Data Wrangling Data Lifecycle Management Data integration Data Governance Aspects Data governance refers to processes, methods and tools used in an enterprise for effective control of availability, usability, integrity, and security of data
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Data Governance: Apache Atlas Data Management along the entire data lifecycle with integrated provenance and lineage capability • Cross component lineage Modeling with Metadata enables comprehensive business metadata vocabulary with enhanced tagging and attribute capabilities • Common Business Language • Hierarchically organized – No dupes ! Interoperable Solutions across the Hadoop ecosystem, through a common metadata store • Combine and Exchange Metadata STRUCTURED TRADITIONAL RDBMS METADATA MPP APPLIANCES Kafka Storm Sqoop Hive ATLAS METADATA Falcon RANGER Custom Partners
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Background: DGI Community becomes Apache Atlas May 2015 Apache Atlas Incubation DGI group Kickoff Dec 2014 Aug 2016 HDP 2.5/ Apache 0.7 Release Global Financial Company * DGI: Data Governance Initiative Key Benefits: • Co-Dev = Built for real customer use cases • Faster & Safer = Customers know business + HWX knows Hadoop • Code contributors - Hortonworks, IBM, Aetna , Merck, Target Jul 2015 HDP 2.3/ Apache 0.5 Foundation Release
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atlas Type System • Defines model – schema of metadata • Flexible and powerful to define any model/custom types • Supports inheritance • Types • Primitive types – bool, integer types, string, date, enum • Collections - array, map • Struct – set of attributes • Class – Identifiable struct, hierarchy • Trait – set of attributes, hierarchy
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Model DataSet metaType: ClassType name: String required hive_db metaType: ClassType name: string required createTime: date required parameters: map<string,string> optional hive_table metaType: ClassType db: hive_ db required createTime: date required columns: array<hive_column> required hive_column metaType: ClassType name: string required type: string required extends references references 0..n
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Entities Instances of types Name: rawlogs Guid: 1 createTime: 2015-01-01 10:00 Type: hive_db name: impressions Guid: 2 Type: hive_table name: adv_id type: string Guid: 3 Type: hive_column name: user_id type: string Guid: 4 Type: hive_column db column column EXPIRES_ON Time: March, 2016 PII trait trait
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Graph Engine • Graph Database • Titan with storage backed by HBase • Types and Entities are translated to the Graph Model • Classes, Structs and Traits map to a vertex • Relationships are mapped as edges • Rich relationships between metadata objects • Indexing and Search • Indexing based on type annotations • External indexing – Titan backed by Solr
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Titan property graph model Graph Search with Gremlin saturn = g.V.has('name','saturn').next() hercules = saturn.as(‘x’).in(‘father’).loop(‘x’) { it.loops > 3}.next() hercules.outE(‘battled’).has(‘time’, T.gt, 1).inV.name cerberus  hydra
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Search Find Relevant Assets based on their attributes , associations with business terms DSL with sql like syntax based on type system from $type is $trait where $clause select|has $attributes, repeat Examples  Select columns from a hive_table where its name is “impressions” and db name is “raw” hive_column where table.name=”impressions", table.db.name = ‘raw’  Select all columns from hive tables which are tagged as “PII” hive_column is ‘PII’ Full text search ‘(rawlogs) AND hive’ ‘(rawlogs OR supply*) AND hive’
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Features and Roadmap
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Component Integration & Lineage • Cross- component dataset lineage. Centralized location for all metadata inside HDP • Single Interface point for Metadata Exchange with platforms outside of HDP Apache Atlas Hive Ranger Falcon Sqoop Storm Kafka Spark NiFi HBase Partner Custom HDP 2.3 HDP 2.5 Beyond HDP 2.5 HDP 2.5 External
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Catalog for Ease of Use  Organize data assets along business terms – Authoritative: Hierarchical Taxonomy Creation – Agile modeling: Model Conceptual, Logical, Physical assets – Definition and assignment of tags like PII (Personally Identifiable Information)  Comprehensive features for compliance – Multiple user profiles including Data Steward and Business Analysts – Object auditing to track “Who did it” – Metadata Versioning to track ”what did they do”  Faster Insight: ( Roadmap ) – Data Quality tab for profiling and sampling – User Comments Key Benefits: Organize data assets along business terms Compliance Features: Faster Insight
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Introduction Centralized authorization and auditing across Hadoop components • HDFS, Hive, HBase, Knox, Strom, YARN, Kafka, Solr, .. • Audit logs to: Solr, HDFS, RDBMS, Log4j, .. Resource based security • Policies for specific set of resources • Requires revision of policies as resources get added/moved Classification based security • Policies for classifications and not for specific resources • A single policy protects resources in multiple components • As classification for resources change, appropriate policies would automatically be applied • Enables separation of duties: resource-classification and security policies
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scalable Access Control – Reusable Tag Policy User group • AD • Linux Resources: • Files • Tables • Topologies Atlas Tag • PII ANY asset PII • Files • Tables • Topologies Single Admin Group Assigns Many Stewards Tag + Single point of enforcement and audit All future tagging is covered by existing policy Not Scalable Scalable
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Open: Governance Ready Certification Program Choice: Customers choose features that they want to deploy—a la carte versus vendor lock Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy Agile: Low switching costs, Faster deployment and innovation Centralized : Common SLA & common open metadata store Flexibility: Interoperability of products through Atlas metadata Safe: HDP at core to provide stability and interoperability Completed: • Waterline • Dataguise • Attivio • Trifacta Pending: • Collibra • Alation • Meta Integration (Miti) • Paxata • Syncsort • Talend
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Roadmap… • MultiTenancy • Titan 1.x Migration • Hive Column Level Lineage
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary • Designed for Hadoop at platform, not application level • High Confidence data in Hadoop for regulated verticals • Compliance and business objectives aligned to data organization • Faster discovery for analysts – reduce time to value • Agile and adaptable – ensures information is current by native connectors • Dynamic protection with Ranger in simple audited policies
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn More: • Apache Incubator link http://atlas.incubator.apache.org/ • Hortonworks links: http://hortonworks.com/solutions/security-and- governance/ • https://community.hortonworks.com/spaces/64/governance-lifecycle- track.html?topics=Atlas&type=question • Atlas Technical User Guide - http://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Apache Ranger + Atlas Integration
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Atlas work with Ranger at scale? Atlas provides: Metadata • Business Classification (taxonomy): Company > HR > Driver • Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver • Atlas will notify Ranger via Kafka Topic for changes Apache Atlas Hive Ranger Falcon Kafka Storm Atlas provides the metadata tag to create policies Ranger provides: Access & Entitlements • Ranger will cache tags and asset mapping for performance • Ranger will have a policy based on tags instead of roles. • Example: PII = <group> This can work for a may assets.
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Automatic update of policies – active protection Metastore • Tags • Assets • Entities Notification Framework Kafka Topics Atlas Atlas Client • Subscribes to Topic • Gets Metadata Updates PDP Resource Cache Ranger Notification Metadata updates Message durability Optimized for Speed Event driven updates
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Authorization and Auditing HBase Ranger Administration Portal HDFS Hive Server2 Ranger Audit StoreRanger Policy Store Ranger Plugin Hadoop Components Enterprise Users Log4j Knox Storm YARN Kafka Solr HDFS Solr Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin RDBMS
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Governance Current Landscape • Opaque Data and in variety of data stores – HDFS, S3, Data warehouses • Schema is hardly sufficient – Hive Metastore, Avro, Data Warehouse • Platform tools like Ranger and Falcon solve parts of the problem Need for Data governance Organizations need data governance to understand its information to answer questions such as: • What do we know about our information? • Where did this data come from and how’s it being used? • Does this data adhere to company policies and rules? • Need for effective control and consumption of data Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Taxonomy Business Taxonomy (Catalog) The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication. Tags: Traits vs. Labels vs. Business Taxonomy Atlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales. Benefits: A view of data assets organized by business language Compliance, Acceptable use – Dynamic Metadata based access control Common taxonomy through Hadoop components
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Principle Roles & Activities in an Enterprise • Data Steward – Curator, responsible for data classification – associate business taxonomy and tagging, access policies • Data Scientist – Analyst, primary consumer of Business Taxonomy • Administrator/Operations – Role management, Data lifecycle management (Archival, retention) • Data Engineer – Data ingress and egress, semantic data quality • 50% - 80%+ Time spend looking for data • Profit Center • Primary User of Atlas • Enables Scientist Goal: < 25% spent on finding data = Empowering scientist to spend their time uncovering insights -- faster
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance Usecases : Impact analysis  HortonAdNetwork – A large size Ad network which has an international footprint with multiple publishers and advertisers across several countries  Complex ETL jobs and data pipelines processing real-time ad network data from several different sources and various data processing platforms  No easy way to determine the root cause when something is off charts  Data analysts need effective data provenance tools for Impact/Root cause anaylsis  Cross component lineage is a must  Data Lineage (Provenance) Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance Usecases - Compliance  HortoniaBank – mid size bank expanding from US to international markets  2 Customer Tables owned by BH: 50K customer records each with 38 fields (PII, PHI, PCI & non-sensitive data) – us_customers: USA person data only – ww_customers: multi-language, multi-country, localized person data  1 data set of prospects leased from a data broker – tax_2010: Data lease expired already!
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Group Access Privileges joe_analyst us_employee US Data Only, non-sensitive data only, rest forbidden depending on sensitivity kate_hr us_hr US Data Only, All sensitive data (PCI, PII, PHI) Tag Based Policies  US HR team members can see all original data (PCI, PII,….)  Analysts are prohibited from viewing PII data in any of the tables  Anyone except operations/Admin are prohibited to access tax_2010 after the specified date - Expires_on policy turns off access on the configured expiry date
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Expanded Native Connector: Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT of the box

Editor's Notes

  • #7: Inventory, publisher, site, supply Advertiser, demand,
  • #8: Is the product was well understood? Is the product something they would use? Where is the value?
  • #10: 9
  • #11: How fast ? 7 months !
  • #24: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #25: Colibra  —   business process workflow + mapping to regulation in various countries and standards Alation  -  Socializing of analytics - sql / traditional edw based Meta Integration (Miti) Paxata  - wrangling Syncsort  - ETL - specializing traditional system and Mainframe Talend - ETL, metadata management Attivio  — ingestion  / discovery
  • #33: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #37: Is the product was well understood? Is the product something they would use? Where is the value?
  • #39: Make sure Audits are demod for policy denials and acceptances
  • #41: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together