SlideShare a Scribd company logo
Cloudera Navigator
Headline Goes Here
Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Outline
●
●
●
●

Capabilities
Architecture
Quick Demo
Q&A
Capabilities
●

Discovery
○

○

●

Lineage
○
○

●

Search through metadata to find data set/operation of
interest.
View schema, associated metadata etc. for a dataset
Given a data set, trace back to the original source.
Understand the impact of modifying a data set.

Audit
○
○

Generate report of access to a data set in Hadoop.
Generate alert when a restricted data set is accessed.
Discovery & Lineage(Questions to be asked?)
●
●
●

Ad-hoc or only predefined?
Granularity?
Analysis?
Discovery & Lineage (Supported Systems)
●
●
●
●
●
●
●

HDFS
Hive
MR1
Oozie
Pig
YARN
...More coming...
Discovery (Metadata Search)
Discovery (Metadata Search)
Discovery (Metadata Search)
Discovery (View Schema)
Discovery (Augment Metadata )
Discovery (Search on associated metadata)
Sidecars.. (Colocation of associated metadata)
/user/root/customers/cust_demo
/user/root/customers/.cust_demo.navigator
Contents of .cust_demo.navigator
{
"properties" : {
"secret" : "true",
"retention" : "small"
},
"tags" : ["pci"]
}
Lineage (Hive Query)
INSERT OVERWRITE TABLE machine_vendors
SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper
(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca.
address_state,ca.customerKey,cm.clusterId,ms.machineName
FROM crm_accounts ca JOIN cluster_metadata cm
ON ca.customerKey = cm.customerKey JOIN machine_stats ms
ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
Lineage
Lineage (Path highlighted)
Lineage (Instance)
Lineage (Template)
Lineage (Pig Script)
posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage()
AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray,
score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int,
lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray,
tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray,
communityOwnedDate:chararray);

comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage()
AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray,
userId: int);

joined_post_comments = JOIN posts by id, comments by postId;

post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate,
comments::postId..comments::userId;
grouped_comments = GROUP post_comments BY posts::id;
comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment;
rmf stackoverflow/output/comments_per_post
STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
Lineage (Pig)
Discovery & Lineage Architecture
Model
●
●

Generic (Element, Relations)
Element
○
○
○

Unique Identity
Key-value pairs
Tags

(Operation, Operation Execution, FSElement, Table,
Column…)
Model (Contd…)
●

Relation
○
○
○

Unique Identity
Two sets of related elements
Relationship type

(Parent Child Relation, Data Flow Relation, Control Flow
Relation, Instance Of Relation, Alias Relation, Generic
Relation)
Discovery & Lineage (REST API)
●

Elements Resource
○

curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100'

[{
"identity" : "513bf7add8d5f56b7f0f34769707cb5f",
"originalName" : "job_1389320017591_0024_conf.xml",
"firstClassParentId" : null,
"name" : null,
"description" : null,
"tags" : null,
"properties" : null,
"fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml",
"category" : "FILE",
"size" : 139211,
"lastModified" : "1969-12-31T23:59:59.999Z",
"lastAccessed" : "2014-02-04T02:12:01.369Z",
"owner" : "root",
"group" : "hadoop",
"blockSize" : null,
"mimeType" : "application/octet-stream",
"replication" : null,
"deleted" : false,
"resType" : "HDFS",
"permission" : 432,
"resId" : "858e5548b4cd3457432eb491ee74729d",
"type" : "fselement"
}, ...]

○
○

curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’
curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost:
5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
Discovery & Lineage (REST API)
●

Relations Resource
curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2'
[{
"identity" : “91540192d3dd727f912b3c0bb91cdd81”,
"type" : “PARENT_CHILD",
"parent" : [ {
"elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c",
},"children" : [ {
"elementIds" : [ "6144fabee63641275c5577697f16266a" ],
}
"name" : null},...]

●

Interactive Resource
curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2'
{
"offset" : 0,
"totalMatched" : 2,
"limit" : 1,
"results" : [ {
"identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889",
"category" : "DIRECTORY",... },
"facets" : { },
"qtime" : 10
}
Audit (Supported Systems)
●
●
●
●
●

HDFS
HBase
Hive
Impala
...More coming...
Audit Configuration
Audit View
Audit Details
●

User
○

●

Operation Information
○

●

Username, Impersonator, Ip Address
Operation Type, Session Id, Query Id, Operation Text, Status,
Time

Object Information
○

ServiceName, Path (Different in different systems)
Audit Architecture

Log4j
Appender
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

More Related Content

PPT
Spring data presentation
PPTX
Java and OWL
PDF
myEquivalents, aka a new cross-reference service
PPT
Parquet and impala overview external
PDF
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
PPTX
The Future of Hadoop Security - Hadoop Summit 2014
PDF
Using Apache Solr
PDF
Java Web Programming on Google Cloud Platform [2/3] : Datastore
Spring data presentation
Java and OWL
myEquivalents, aka a new cross-reference service
Parquet and impala overview external
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
The Future of Hadoop Security - Hadoop Summit 2014
Using Apache Solr
Java Web Programming on Google Cloud Platform [2/3] : Datastore

Similar to Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters (20)

PDF
Introduction to Datastore
PDF
PDF
PPT
Reflection Slides by Zubair Dar
ODP
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
PDF
Using Search API, Search API Solr and Facets in Drupal 8
PDF
Handout: 'Open Source Tools & Resources'
PPTX
Introducing Datawave
PPTX
Graph Databases in the Microsoft Ecosystem
PDF
Neo4j: Graph-like power
PPTX
AI from your data lake: Using Solr for analytics
PDF
Web Crawling with Apache Nutch
PDF
Stardog Linked Data Catalog
PDF
Stardog Linked Data Catalog
PDF
Five android architecture
PDF
Spring data requery
PPTX
Kantara OTTO slides
PDF
Graph Analytics with ArangoDB
PDF
Reflected intelligence evolving self-learning data systems
PDF
groonga with PostgreSQL
Introduction to Datastore
Reflection Slides by Zubair Dar
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
Using Search API, Search API Solr and Facets in Drupal 8
Handout: 'Open Source Tools & Resources'
Introducing Datawave
Graph Databases in the Microsoft Ecosystem
Neo4j: Graph-like power
AI from your data lake: Using Solr for analytics
Web Crawling with Apache Nutch
Stardog Linked Data Catalog
Stardog Linked Data Catalog
Five android architecture
Spring data requery
Kantara OTTO slides
Graph Analytics with ArangoDB
Reflected intelligence evolving self-learning data systems
groonga with PostgreSQL
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectroscopy.pptx food analysis technology
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

  • 1. Cloudera Navigator Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12
  • 3. Capabilities ● Discovery ○ ○ ● Lineage ○ ○ ● Search through metadata to find data set/operation of interest. View schema, associated metadata etc. for a dataset Given a data set, trace back to the original source. Understand the impact of modifying a data set. Audit ○ ○ Generate report of access to a data set in Hadoop. Generate alert when a restricted data set is accessed.
  • 4. Discovery & Lineage(Questions to be asked?) ● ● ● Ad-hoc or only predefined? Granularity? Analysis?
  • 5. Discovery & Lineage (Supported Systems) ● ● ● ● ● ● ● HDFS Hive MR1 Oozie Pig YARN ...More coming...
  • 11. Discovery (Search on associated metadata)
  • 12. Sidecars.. (Colocation of associated metadata) /user/root/customers/cust_demo /user/root/customers/.cust_demo.navigator Contents of .cust_demo.navigator { "properties" : { "secret" : "true", "retention" : "small" }, "tags" : ["pci"] }
  • 13. Lineage (Hive Query) INSERT OVERWRITE TABLE machine_vendors SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper (trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca. address_state,ca.customerKey,cm.clusterId,ms.machineName FROM crm_accounts ca JOIN cluster_metadata cm ON ca.customerKey = cm.customerKey JOIN machine_stats ms ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
  • 18. Lineage (Pig Script) posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray, score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int, lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray, tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray, communityOwnedDate:chararray); comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray, userId: int); joined_post_comments = JOIN posts by id, comments by postId; post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate, comments::postId..comments::userId; grouped_comments = GROUP post_comments BY posts::id; comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment; rmf stackoverflow/output/comments_per_post STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
  • 20. Discovery & Lineage Architecture
  • 21. Model ● ● Generic (Element, Relations) Element ○ ○ ○ Unique Identity Key-value pairs Tags (Operation, Operation Execution, FSElement, Table, Column…)
  • 22. Model (Contd…) ● Relation ○ ○ ○ Unique Identity Two sets of related elements Relationship type (Parent Child Relation, Data Flow Relation, Control Flow Relation, Instance Of Relation, Alias Relation, Generic Relation)
  • 23. Discovery & Lineage (REST API) ● Elements Resource ○ curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100' [{ "identity" : "513bf7add8d5f56b7f0f34769707cb5f", "originalName" : "job_1389320017591_0024_conf.xml", "firstClassParentId" : null, "name" : null, "description" : null, "tags" : null, "properties" : null, "fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml", "category" : "FILE", "size" : 139211, "lastModified" : "1969-12-31T23:59:59.999Z", "lastAccessed" : "2014-02-04T02:12:01.369Z", "owner" : "root", "group" : "hadoop", "blockSize" : null, "mimeType" : "application/octet-stream", "replication" : null, "deleted" : false, "resType" : "HDFS", "permission" : 432, "resId" : "858e5548b4cd3457432eb491ee74729d", "type" : "fselement" }, ...] ○ ○ curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’ curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost: 5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
  • 24. Discovery & Lineage (REST API) ● Relations Resource curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2' [{ "identity" : “91540192d3dd727f912b3c0bb91cdd81”, "type" : “PARENT_CHILD", "parent" : [ { "elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c", },"children" : [ { "elementIds" : [ "6144fabee63641275c5577697f16266a" ], } "name" : null},...] ● Interactive Resource curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2' { "offset" : 0, "totalMatched" : 2, "limit" : 1, "results" : [ { "identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889", "category" : "DIRECTORY",... }, "facets" : { }, "qtime" : 10 }
  • 28. Audit Details ● User ○ ● Operation Information ○ ● Username, Impersonator, Ip Address Operation Type, Session Id, Query Id, Operation Text, Status, Time Object Information ○ ServiceName, Path (Different in different systems)