SlideShare a Scribd company logo
1©	Cloudera,	Inc.	All	rights	reserved.
Enterprise	Metadata	Integration	
Mirko Kämpf |	Cloudera
GraphConnect 2017	– London
2©	Cloudera,	Inc.	All	rights	reserved.
Who	is	speaking?
Solutions	Architect	@	Cloudera
-time	series	analysis,	network	analysis,	data	enrichment	pipelines
-personal	interest:	QA-Systems	and	semantic	search
Data	Science	Activities
The	Detection	of	Emerging	Trends	Using	Wikipedia	Traffic	Data
and	Context	Networks	(PLOS	ONE,	2015)
Hadoop.TS (IJCA,	2013)
Fluctuations	in	Wikipedia	Access-Rate	and	Edit-Event	Data.	
(Physica A,	2012).
3©	Cloudera,	Inc.	All	rights	reserved.
Our	Approach: Multilayer	Metadata	Integration	…
• Status	dashboards	are	provided	per	Topic	/	Use-Case.
• Each	dashboard	offers	facts	from	multiple	layers:
- (L1)	Cluster	specific	metadata
- (L2)	Hadoop	specific	ops-metadata	(only)
- (L3)	Application	specific	ops-metadata
- (L4)	Quality	metrics	and	derived	facts
• Current	Project	Status:
• Graph	database	Neo4J and	Cypher	allow	context	exploration.
• Cluster	spanning	metadata	exploration	is	possible.	
• Exposure	of	inherent	but	sometimes	hidden	facts becomes	as	easy	as	writing	an	email.
Integration	of	facts	
to	gain	business	
knowledge
4©	Cloudera,	Inc.	All	rights	reserved.
Agenda	
EMI	- Enterprise	Metadata	Integration
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
5©	Cloudera,	Inc.	All	rights	reserved.
How	To	Become	Data	Driven?
Treat	“data	as	a	resource“	for	your	business.
Think	in	terms	of	dataset	life	cycles.
6©	Cloudera,	Inc.	All	rights	reserved.
People	do	mining	…	for	centuries!
http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html
gold	&	diamonds,	
ore	&	coal,	
minerals,	
oil	…
Outcome	drives	whole	economy
7©	Cloudera,	Inc.	All	rights	reserved.
People	use	computers	…	for	decades!
1938	
Z1:	World’s	first	free	programmable	
device,	created	by	Conrad	Zuse.
U.S.	Department	of Energy uses Intel
Supercomputer	 at	Argonne National	Laboratory.
2015
http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png
http://www.horst-zuse.homepage.t-online.de/z1.html
8©	Cloudera,	Inc.	All	rights	reserved.
DATA
MINING
http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/
Blog: About Learning Data Mining & Data Analysis
9©	Cloudera,	Inc.	All	rights	reserved.
If	data	is	the	new	oil	…
…	metadata	are	nuggets	
and	brilliants	of	our	age.
Screenshot	 taken	from:	
https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil
10©	Cloudera,	Inc.	All	rights	reserved.
Diamonds: are	beautiful	even	as	
raw	material.
Brilliant: is	a	result	of	expert’s	work.
You	have	to	cut	and	grind	it!	
Even	more	exciting	in	combination	
with	other	material	and	skills	…
Process	optimization
Requires	knowledge	
gathering	and	transfer.
11©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Success	Factors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html
12©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Success	Factors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html
Tools	and	processes	evolve	…
...	success	criteria	have	been	stable.
13©	Cloudera,	Inc.	All	rights	reserved.
Let’s	Think	Data	Driven!	
•Build	a	long-term	strategy!
Not	the	fancy	toolset	but	rather	your	data is	what	matters	most!
• After	initial	success	you	should	carefully	control	speed	of	expansion.
• Maximize	accessibility	of	data!
Example:	Google’s	goal	was	to	make	the	data	of	the	internet	accessible.	
You	should	become	your	own	Google!
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
14©	Cloudera,	Inc.	All	rights	reserved.
Dataset	Profiles	/	Flow	Descriptors
•Our	material	is	data	&	metadata:	
- Data	about	data	:	descriptive	data,	Dublin	core	metadata	model,	…
- Derived	data	:	statistics	extracted	from	processes,	documents,	…
- Results	of	ML/AI	procedures	:	extracted	structure	and	learned	models
- Outcome	of	crowd	based	operations	:	Wikipedia with	its	inherent	
structure,	communication	logs,	access	and	edit	history.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
15©	Cloudera,	Inc.	All	rights	reserved.
Knowledge	Extraction	for	
Better	Data	Science
16©	Cloudera,	Inc.	All	rights	reserved.
Science:
According	to	Wikipedia:
Science	is	a	systematic	
enterprise	that	builds	and	
organizes	knowledge in	the	
form	of	testable	explanations
and predictionsabout	
the universe.
https://en.wikipedia.org/wiki/Science
17©	Cloudera,	Inc.	All	rights	reserved.
Data	Science:
My	observation:
Data Science
is	a	systematic	enterprise	
that	builds	and	organizes
knowledge in	the	form	of	
testable explanations and
predictions about the
market	and	business	context.
https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif
18©	Cloudera,	Inc.	All	rights	reserved.
Details
Look	into	nature	….
19©	Cloudera,	Inc.	All	rights	reserved.
Context
Look	into	nature	….
20©	Cloudera,	Inc.	All	rights	reserved.
Result:	Visualization	of	Facts
• An	image	shows	what	the	text	says.	
>	Multi-channel	communication
• Data	Science	benefits	from	such	an	approach.
>	Today	we	still	use	infographics
Difference:	
Biologist	who	created	the	image	on	the	left	observed	
by	eye.
Today,	data	scientists,	look	more	into	data	than	into	
nature.
21©	Cloudera,	Inc.	All	rights	reserved.
Process:	Knowledge	Extraction	is	a	Natural	Process	
• Combine	multiple	sources	
• Repeat	observation
• Incorporate	context	to	explain	
differences/variation	
• Cross-checks	to	identify	
anomalies
22©	Cloudera,	Inc.	All	rights	reserved.
Process:	Knowledge	Extraction	is	a	Natural	Process	
Knowledge
Facts	
Data
23©	Cloudera,	Inc.	All	rights	reserved.
How	did	we	implement	EMDM?
- Hadoop	Based:	for	scalability.
- Open	Graph	Data	Model:	for	flexibility	and	connectivity
- Data	Centric:	following	the	Big	Data	paradigm
24©	Cloudera,	Inc.	All	rights	reserved.
Big	Data	Processing:
e.g.,	with	Hadoop
25©	Cloudera,	Inc.	All	rights	reserved.
Big	Graph	Processing	on	Hadoop:
e.g.,	with	Giraph
26©	Cloudera,	Inc.	All	rights	reserved.
Project	Name	should	stand	for:	
Graphs,	Hadoop,	and	the	ecosystem	…
27©	Cloudera,	Inc.	All	rights	reserved.
Project	Name	should	stand	for:	
Graphs,	Hadoop,	and	the	ecosystem	…
28©	Cloudera,	Inc.	All	rights	reserved.
Data	Science	Process	Model	(DSPM)
• DSPM	defines	core	artifacts	for	knowledge	management
• Describes	analysis	/	transformation	context	
• Allows	repeatable	execution
• Process	properties	become	measurable
• Supports	comparison	of	results	from	multiple	procedures
• All	those	facts	are	essential	ingredients	to	business	optimization.
• But:	Logging	&	tracking should	never	block	creativity!	
• Remember:	Scientists	often	act	like	artists.	
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Toolbox	and	
Management	Methods
29©	Cloudera,	Inc.	All	rights	reserved.
Data	Science	Process	Model	(DSPM)
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Representation	of	domain	knowledge	
(in	our	case	it	is	data	science	in	general)	
Human	
Interaction
Ontology Toolbox	and	
Management	Methods
Ability	to	solve	
a	problem	using	
IT	and	data
Technology	Aspects
- represent	and	inter-
act	with	facts	&	data
Data	Governance
Certified	QM
30©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Semantic	Logging
• Property	with	name:	(K,V) :			key-value	pair
• Property	of	a	thing:	S	=>	(K,V) :			(S,P,O)		is	a triple
K	becomes	P; V	becomes	O
• Many	of	those	triples	in	one	common	context	with	name	G:
G	=>	(S,P,O)	is	called	quad or	named	graph
We	have	to	hide	this	technical	details	from	users!
Obvious	facts	have	to	be	connected	to	the	knowledge	graph	as	direct	as	possible.
• Log4J	is	the	logging	standard	we	build	on.
• Using	structured	data	instead	of	plain	strings	allows	easy	parsing	(e.g.,	apache	log	format).
• Triple	representation	avoids	specific	parsing	and	makes	log	data	part	of	the	linked	data	graph.
31©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Etosha Toolbox
Data	extractors,
Data	transformers,
Ontology	based	orchestration,
People	and	machines,		
contribute	facts,
Iterative	approach	with	
closed	feedback-loops,
Scalable	environment	…
C
O
N
C
E
P
T
32©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Multi-layer	metadata	capturing
Operational	metrics
Metrics	about	fast	&	static	data
Business	metrics
Contextualized	presentation
Ad-hoc	queries	for	exploration
Graph-analytics
>	Knowledge	exposure
>	Self-Service	DS	and	BI	can
speak	the	same	language.
I
N
I
T
I
A
L
I
M
P
L
E
M
E
N
T
A
T
I
O
N
33©	Cloudera,	Inc.	All	rights	reserved.
Results:	Better	Collaboration	for	
(Hadoop)	Knowledge	Workers
• Our	Achievements:
• The	open	graph	model	is	language-,	OS-,	and	hardware-independent.
• Merging	of	knowledge	partitions	enables cluster	spanning	metadata	exploration.
• Query	beans	expose	facts	from	multiple	stores	to	web-based	interfaces.
• Next	Steps:
• Improve	implicit	triplification (Query	Solr-index	and	get	RDF	data)
• Standardize	the	process	and	integrate	with	existing	ontologies.
• Grow	a	community	…	and	enter	the	Apache	Incubator.
34©	Cloudera,	Inc.	All	rights	reserved.
Results:	Access	Facts & Context of	Critical	Processes
DEMO:	https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be
35©	Cloudera,	Inc.	All	rights	reserved.
Thank	you!
Many	thanks	to	the	
Cloudera	team	which	
supported	this	work.

More Related Content

PPTX
PCAP Graphs for Cybersecurity and System Tuning
PPTX
Apache Spark in Scientific Applciations
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PDF
ASPgems - kappa architecture
PDF
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
PDF
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
PDF
Time Series Analysis Using an Event Streaming Platform
PCAP Graphs for Cybersecurity and System Tuning
Apache Spark in Scientific Applciations
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
ASPgems - kappa architecture
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
Time Series Analysis Using an Event Streaming Platform

What's hot (20)

PDF
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
PDF
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
PDF
Can Apache Kafka Replace a Database?
PDF
Fast data for fitness 10 nov 2020
PDF
Operational Analytics on Event Streams in Kafka
PDF
Kafka Migration for Satellite Event Streaming Data | Eric Velte, ASRC Federal
PPTX
Distributed Data Quality - Technical Solutions for Organizational Scaling
PPTX
Data Integration with Apache Kafka: What, Why, How
PDF
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
PDF
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
PDF
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
Continus sql with sql stream builder
PPTX
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
PDF
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
PPTX
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
PDF
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
PDF
Money Heist - A Stream Processing Original! | Meha Pandey and Shengze Yu, Net...
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
Can Apache Kafka Replace a Database?
Fast data for fitness 10 nov 2020
Operational Analytics on Event Streams in Kafka
Kafka Migration for Satellite Event Streaming Data | Eric Velte, ASRC Federal
Distributed Data Quality - Technical Solutions for Organizational Scaling
Data Integration with Apache Kafka: What, Why, How
Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Leveraging Mainframe Data for Modern Analytics
Continus sql with sql stream builder
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
Money Heist - A Stream Processing Original! | Meha Pandey and Shengze Yu, Net...
Ad

Similar to Enterprise Metadata Integration (20)

PPTX
Enterprise Metadata Integration, Cloudera
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
JOSA TechTalk: Metadata Management
in Big Data
PPTX
Strategies for Enterprise Grade Azure-based Analytics
PPTX
Turning Data into Business Value with a Modern Data Platform
PDF
Looking Before You Leap into the Cloud: A proactive approach to machine learn...
PDF
Data Strategy – What Does an Enterprise Data Cloud Mean for Your Agency?
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PDF
Open Source Data Management for Industry 4.0
PPTX
Etosha - Data Asset Manager : Status and road map
PDF
ADV Slides: Data Pipelines in the Enterprise and Comparison
PPTX
151116 Sedania Cloudera BDA Profile
PPTX
Cloud Data Warehousing with Cloudera Altus 7.24.18
PPTX
Hadoop and Manufacturing
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PPTX
Speak to Your Data
PDF
Meet up roadmap cloudera 2020 - janeiro
PDF
What’s in Your Data Warehouse?
Enterprise Metadata Integration, Cloudera
Modern Data Warehouse Fundamentals Part 2
JOSA TechTalk: Metadata Management
in Big Data
Strategies for Enterprise Grade Azure-based Analytics
Turning Data into Business Value with a Modern Data Platform
Looking Before You Leap into the Cloud: A proactive approach to machine learn...
Data Strategy – What Does an Enterprise Data Cloud Mean for Your Agency?
Modern Data Warehouse Fundamentals Part 1
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Open Source Data Management for Industry 4.0
Etosha - Data Asset Manager : Status and road map
ADV Slides: Data Pipelines in the Enterprise and Comparison
151116 Sedania Cloudera BDA Profile
Cloud Data Warehousing with Cloudera Altus 7.24.18
Hadoop and Manufacturing
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Speak to Your Data
Meet up roadmap cloudera 2020 - janeiro
What’s in Your Data Warehouse?
Ad

More from Dr. Mirko Kämpf (9)

PPTX
IoT meets AI in the Clouds
PPTX
Improving computer vision models at scale (Strata Data NYC)
PDF
Improving computer vision models at scale presentation
PPTX
From Events to Networks: Time Series Analysis on Scale
PPTX
Apache Spark in Scientific Applications
PPT
DPG Berlin - SOE 18 - talk v1.2.4
PPT
Information Spread in the Context of Evacuation Optimization
PDF
Hadoop & Complex Systems Research
PDF
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
IoT meets AI in the Clouds
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale presentation
From Events to Networks: Time Series Analysis on Scale
Apache Spark in Scientific Applications
DPG Berlin - SOE 18 - talk v1.2.4
Information Spread in the Context of Evacuation Optimization
Hadoop & Complex Systems Research
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PDF
Lecture1 pattern recognition............
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Introduction to Business Data Analytics.
PDF
Mega Projects Data Mega Projects Data
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Quality review (1)_presentation of this 21
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
Foundation of Data Science unit number two notes
Lecture1 pattern recognition............
Business Acumen Training GuidePresentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Business Data Analytics.
Mega Projects Data Mega Projects Data
oil_refinery_comprehensive_20250804084928 (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
Major-Components-ofNKJNNKNKNKNKronment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IBA_Chapter_11_Slides_Final_Accessible.pptx
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Ppt On Nestle.pptx huunnnhhgfvu

Enterprise Metadata Integration