SlideShare a Scribd company logo
IBM Analytics Platform Group
Enterprise Graph Analytics
Enterprise large scale graph analytics and computing base on distribute
graph database(Titan DB HBase/Solr) and distributed graph computing in
memory(TinkerPop Hadoop Gremlin SparkGraphComputer) and Hadoop2
• Jun(Terry) Yang • yangjuncn@cn.ibm.com
• Jing Chen(Jerry) He • jinghe@us.ibm.com
• Hadoop Summit 2017 • June 15, 2017
2© IBM 2017 Hadoop Summit 2017
Agenda
• Challenges in hybrid data analytics
• Enterprise data quality analytics system based on graphed metadata
• Graph in enterprise data quality analytics solution
3© IBM 2017 Hadoop Summit 2017
Hybrid data analytics and challenges
How was “total quantity” calculated? Show me the lineage?
What are the source-to-target mappings for the DW?
Who read the “sales” data in non-working time? How to ensure data quality?
Data Warehouse Architect
Auditor
Business Person
Data Architect
4© IBM 2017 Hadoop Summit 2017
How to handle the challenges?
DataGovernance
Data Lifecycle
Management
Data Quality
Management
•Correctness
Consistency
Completeness
Timeliness
Metadata
…
Master Data
management
…
5© IBM 2017 Hadoop Summit 2017
What is Metadata?
• The data used to describe other data
− Simple Metadata
− Rich Metadata
• inode attributes for file management
• Filesystem object attributes include metadata,
like modify time, access, owner, permission, etc.
File systems metadata
• Schema for data management
• Ownership information of data
• Server/Database information of data
DBMS/DW/NOSQL metadata
How to manage the metadata cross server/system/platform?
6© IBM 2017 Hadoop Summit 2017
Agenda
• Challenges in hybrid data analytics
• Enterprise data quality analytics system based on graphed metadata
• Graph in enterprise data quality analytics solution
7© IBM 2017 Hadoop Summit 2017
Advantage of Graph in Metadata management
Traditional solution
• Limited in one server/system
• Metadata managed within a
server/system
Property Graph based solution
• Integrate metadata
• Handle storage pressure
• Efficient Processing and Querying
• Lineage
• Wild range managed
8© IBM 2017 Hadoop Summit 2017
Property Graph
Key1:value1
Key2:value2
Key1:value1
Key2:value2
Label
Edge
Properties
Vertex
G = ( V, E )
Graph Vertices Edges
label1
• Born for relationship
• Intuitive modeling
• Expressive querying
• Native analysis
9© IBM 2017 Hadoop Summit 2017
Using Graph Analytics to Find Complex Patterns
1st degree relationship
2nd degree relationship
3rd degree relationship
• Graph queries are a natural
way for analyzing relationship
patterns
 Less complex than SQL
 Can handle high degrees of
relationship with ease
• Graph schema facilitates
visualization and exploration
of relationships
10© IBM 2017 Hadoop Summit 2017
Case study - Audit data access
• Data theft risk in enterprise in hybrid
– Most data stolen by internal person.
– Most data theft happened in non-working time.
– Over-granting of privileges may cause data theft.
11© IBM 2017 Hadoop Summit 2017
Enterprise data quality analytics system based
on graphed metadata
Data ingest
finance data
Consumption data
Credit data
Behavioral data
Graphed metadata
…
Feature Selection
Statistical learning
Data analysis
(Graphed) Metadata
analysis
…
Advanced Feature
Selection
Gradient Boosting
Decision Tree
Support Vector
Machine
Random Forests
PageRank(Graph)
…
Modeling
Customer risk rating
Consumption
Capacity
Graph model
…
Recommendation
Consumer behavior
Fraud detection
Risk analytics(Audit)
…
12© IBM 2017 Hadoop Summit 2017
Data ingest
user
programData
Run
Read
name,
job id,
params,
config,
inputs,
outputs,
start_ts,
finish_ts,
…
id,
name,
group,
permission,
…
name,
size,
location,
department,
permission,
parent,
children,
…
ts_hour,
ts_min,
ts_sec,
status,
…
Metadata Integration
Graph-based Traversal
• User
• Program
• Data
• …
•Entitles  Vertices
• User run program
• Program read data
• …
Relationships  Edges
• Name
• ….
Attributes  Properties
Identify entities and relationships Metadata to Graph
13© IBM 2017 Hadoop Summit 2017
Feature Selection
Who read the sensitive sales data in non-working time?
Query: userFeaSele = graph.traversal().
V().has("department","sales").inE("read").outV().hasLabel('progra
m').inE("run").has(“ts_hour",not(within(9,17))).outV()
Find the user who has the access to large amount data?
Query: … withComputer(SparkGraphComputer) …
userAdvFeaSele =
userFeaSele.pageRank().by('pageRank').order().by('pageRank').li
mit(30)
FeatureSelection
AdvancedFeature
Selection
14© IBM 2017 Hadoop Summit 2017
Modeling
• Modeling risk analysis with graphed metadata, information in ERP.
• Analyze the user with employee information from ERP, with years of
working, age, role, to identify suspect. A non-sales person, for
example, an application R&D person, will be the suspect.
• Audit Recommendation.
Risk analysis model
Graph: User List(userAdvFeaSele)
ERP: Employee information
ERP: Violation information
Audit Recommendation
Risk analysis report
Suspects who stole
sensitive data
Advanced
Feature
Selection
Other
system
15© IBM 2017 Hadoop Summit 2017
Agenda
• Challenges in hybrid data analytics
• Enterprise data quality analytics system based on graphed metadata
• Graph in enterprise data quality analytics solution
16© IBM 2017 Hadoop Summit 2017
User data
Machine data
log data
Behavioral data
Graphed metadata
Enterprise data quality system
Feature
analysis
Lineage Metadata
management
Cleansing
Hadoop HBase Hive
HDFS Spark Titan
Solr
…
Data Source
third-party
data
Ingest(load)
Business Application
Risk management
Data audit
Graph in enterprise data quality analytics solution
……
Cost analytics
17© IBM 2017 Hadoop Summit 2017
How to choose Enterprise Graph Database?
Data storing features
Operation and manipulation features
Graph data structures
Query features
Schema and instance representation
Easy and centralized Management
Expose service
Security features
Fast computing
Evaluate Graph database from following perspective:
18© IBM 2017 Hadoop Summit 2017
Titan
• What is Titan
− Distributed Graph Database
− Based on TinkerPop (Gremlin)
− Open Source
• Titan Features
− Distribute
− Scalable : billions edges and vertices
− Real-time
− Transactional database (concurrent users/ACID/..)
− Global graph compute: graph data analytics, report, ETL
− Search: geo, numeric range, and full text search
19© IBM 2017 Hadoop Summit 2017
Titan solution architecture
application
Management API TinkerPop API - Gremlin
Internal API layer
Database layer(Tx, Data, Mgmt, Optimizer)
OLAPI/O
Interface
Storage and Index Interface Layer
HBase
Storage Backend
Solr
External Index Backend
Spark
Big Data Platform
Gremlin
GraphComputer
OLAP OLTP
Hadoop
 Optimized for storing and querying billions of vertices and edges over a cluster
 Supports thousands of concurrent users
 Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)
20© IBM 2017 Hadoop Summit 2017
Backend – HBase & Solr
• HBase
− Tight integration with the Hadoop ecosystem.
− Native support for strong consistency.
− Linear scalability with the addition of more machines.
− Strictly consistent reads and writes.
− Convenient base classes for backing Hadoop MapReduce jobs with HBase tables.
− Support for exporting metrics via JMX.
− Open source under the liberal Apache 2 license.
• Solr
− Solr is the popular, blazing fast open source enterprise search platform from the
Apache Lucene project.
− Solr is a standalone enterprise search server with a REST-like API.
− Solr is highly reliable, scalable and fault tolerant, providing distributed indexing,
replication and load-balanced querying, automated failover and recovery, centralized
configuration and more.
 Data storing features
 Operation and manipulation features
 Graph data structures
 Query features
 Schema and instance representation
Easy and centralized Management
Expose service
Security features
Fast computing
21© IBM 2017 Hadoop Summit 2017
Integration and management
Titan in Ambari
Titan
Deployment
Installation
Uninstallation
Titan client
deployment
Titan server
deployment
Titan server
operation
Start server
Stop server
Service check
Titan
Configuration
Hbase backend
Solr backend
SparkGraphComputer
Titan server
Titan environment
Titan security
Titan security
support
SSL
SASL
LDAP
Kerberos
Knox
HBase Access control
 Data storing features
 Operation and manipulation features
 Graph data structures
 Query features
 Schema and instance representation
 Easy and centralized Management
Expose service
Security features
Fast computing
22© IBM 2017 Hadoop Summit 2017
Remote
Titan service
Mgmt API TP API - Gremlin
Internal API layer
Database layer
OLAPI/O
Storage and Index Interface Layer
HBase Solr
Spark
Gremlin
GraphComputer
Gremlin Server Gremlin Console
Titan Engine
{RESTful} {Web Socket} Gremlin>
local
Titan server Titan client
 Data storing features
 Operation and manipulation features
 Graph data structures
 Query features
 Schema and instance representation
 Easy and centralized Management
 Expose service
Security features
Fast computing
23© IBM 2017 Hadoop Summit 2017
Cluster
Remote
Titan clientTitan server
Titan security enhancement
Spark
Gremlin
Graph
Computer
local
Mgmt API TP API - Gremlin
Internal API layer
Database layer
OLAPI/O
Interface
Storage and Index Interface Layer
HBase Solr
SSL
Knox
SASL
LDAP/OS
/Kerberized
Titan user
HBase
Access
control
Kerberized
Cluster
Security
Description
 Data storing features
 Operation and manipulation features
 Graph data structures
 Query features
 Schema and instance representation
 Easy and centralized Management
 Expose service
 Security features
Fast computing
24© IBM 2017 Hadoop Summit 2017
Integrate TinkerPop
SparkGraphComputer with Titan DB
Mgmt API TP API - Gremlin
Internal API layer
Database layer
OLAPI/O
Interface
Storage and Index Interface Layer
HBase Solr
Gremlin GraphComputer
Graph
RDD
PageRankVertexProgram
PeerPressureVertexProgram
BulkDumperVertexProgram
BulkLoaderVertexProgram
TraversalVertexProgram
Spark-gremlin
SparkGraphComputer
Hadoop gremlin
Spark
 Data storing features
 Operation and manipulation features
 Graph data structures
 Query features
 Schema and instance representation
 Easy and centralized Management
 Expose service
 Security features
 Fast computing
25© IBM 2017 Hadoop Summit 2017
Open source Graph Database
A new Linux Foundation project
formed to continue development of
the TitanDB graph database.
http://janusgraph.org
Last Titan 1.0.0 was
release on Sep 20 2015
26© IBM 2017 Hadoop Summit 2017
References & Contacts
• IBM Open Platform for Apache Hadoop and Apache Spark
− https://www.ibm.com/us-en/marketplace/ibm-open-platform
− https://www.ibm.com/support/knowledgecenter/SSPT3X/SSPT3X_welcome.html
• Graph relevant
− Titan: http://titan.thinkaurelius.com
− JanusGraph: http://janusgraph.org
− TinkerPop: https://tinkerpop.apache.org
Jun(Terry) Yang
yangjuncn@cn.ibm.com
Linkedin.com/in/terryjunyang
Jing Chen(Jerry) He
jinghe@us.ibm.com
Linkedin.com/in/jing-chen-jerry-he-1553511
27© IBM 2017 Hadoop Summit 2017
zzzz
z
z
z
Thanks!
Questions?

More Related Content

PPTX
Hadoop summit 2017 enterprise graph analytics
PPTX
StreamCentral Technical Overview
PPTX
Power BI Advanced Data Modeling Virtual Workshop
 
PPTX
Skillwise Big Data part 2
PPTX
Skilwise Big data
PDF
The importance of efficient data management for Digital Transformation
PDF
PDF
Tapdata Product Intro
Hadoop summit 2017 enterprise graph analytics
StreamCentral Technical Overview
Power BI Advanced Data Modeling Virtual Workshop
 
Skillwise Big Data part 2
Skilwise Big data
The importance of efficient data management for Digital Transformation
Tapdata Product Intro

What's hot (18)

PDF
Analytics in a Day Virtual Workshop
 
PPTX
Building big data solutions on azure
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
PDF
Formulating Power BI Enterprise Strategy
PDF
Prague data management meetup 2017-01-23
PPTX
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
PDF
Seeing Redshift: How Amazon Changed Data Warehousing Forever
PPTX
Azure Databricks for Data Scientists
PPTX
Enable the business and make Artificial Intelligence accessible for everyone!
PDF
Neo4j GraphDay Seattle- Sept19- Connected data imperative
PDF
Enterprise Data Lake - Scalable Digital
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
MariaDB AX ユースケース / ColumnStore 1.2 新機能
PDF
Big Data Architecture
PDF
Modern Data Architecture
PDF
Data Quality in the Data Hub with RedPointGlobal
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Analytics in a Day Virtual Workshop
 
Building big data solutions on azure
IBM Cloud Day January 2021 Data Lake Deep Dive
Big Data Analytics from Azure Cloud to Power BI Mobile
Formulating Power BI Enterprise Strategy
Prague data management meetup 2017-01-23
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Azure Databricks for Data Scientists
Enable the business and make Artificial Intelligence accessible for everyone!
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Enterprise Data Lake - Scalable Digital
Building Modern Data Platform with Microsoft Azure
MariaDB AX ユースケース / ColumnStore 1.2 新機能
Big Data Architecture
Modern Data Architecture
Data Quality in the Data Hub with RedPointGlobal
Big Data Analytics in the Cloud with Microsoft Azure
Ad

Similar to Hadoop Summit 2017 Enterprise Graph Analytics (20)

PDF
ICP for Data- Enterprise platform for AI, ML and Data Science
PDF
Overview - IBM Big Data Platform
PPTX
Arquitectura de Datos en Azure
PDF
Creating a Next-Generation Big Data Architecture
PDF
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
PDF
Analytical Systems Evolution: From Excel to Big Data Platforms and Data Lakes
PDF
OC Big Data Monthly Meetup #6 - Session 1 - IBM
PDF
SD Big Data Monthly Meetup #4 - Session 1 - IBM
PPT
Making Hadoop Ready for the Enterprise
PDF
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
 
PDF
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
 
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPT
ATAGTR2017 Bee-Hive approach for Big Data Testing [End to End Continuous Test...
PPTX
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
PDF
Microservices+Approach+with+IBM+Cloud+Pak+for+Data+-+BACon+2019.pdf
PDF
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
PPTX
Information Virtualization: Query Federation on Data Lakes
PPTX
MongoDB World 2018: Partner Talk - IBM: Climbing the Ladder to AI
PPTX
How does Microsoft solve Big Data?
PDF
Ibm db2update2019 icp4 data
ICP for Data- Enterprise platform for AI, ML and Data Science
Overview - IBM Big Data Platform
Arquitectura de Datos en Azure
Creating a Next-Generation Big Data Architecture
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Analytical Systems Evolution: From Excel to Big Data Platforms and Data Lakes
OC Big Data Monthly Meetup #6 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
Making Hadoop Ready for the Enterprise
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
ATAGTR2017 Bee-Hive approach for Big Data Testing [End to End Continuous Test...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Microservices+Approach+with+IBM+Cloud+Pak+for+Data+-+BACon+2019.pdf
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Information Virtualization: Query Federation on Data Lakes
MongoDB World 2018: Partner Talk - IBM: Climbing the Ladder to AI
How does Microsoft solve Big Data?
Ibm db2update2019 icp4 data
Ad

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Approach and Philosophy of On baking technology
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
A Presentation on Touch Screen Technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Mushroom cultivation and it's methods.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
Getting Started with Data Integration: FME Form 101
Digital-Transformation-Roadmap-for-Companies.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Approach and Philosophy of On baking technology
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
MIND Revenue Release Quarter 2 2025 Press Release
Enhancing emotion recognition model for a student engagement use case through...
cloud_computing_Infrastucture_as_cloud_p
A Presentation on Touch Screen Technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Unlocking AI with Model Context Protocol (MCP)
Programs and apps: productivity, graphics, security and other tools
A comparative study of natural language inference in Swahili using monolingua...
NewMind AI Weekly Chronicles - August'25-Week II
TLE Review Electricity (Electricity).pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A novel scalable deep ensemble learning framework for big data classification...
Mushroom cultivation and it's methods.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Heart disease approach using modified random forest and particle swarm optimi...

Hadoop Summit 2017 Enterprise Graph Analytics

  • 1. IBM Analytics Platform Group Enterprise Graph Analytics Enterprise large scale graph analytics and computing base on distribute graph database(Titan DB HBase/Solr) and distributed graph computing in memory(TinkerPop Hadoop Gremlin SparkGraphComputer) and Hadoop2 • Jun(Terry) Yang • yangjuncn@cn.ibm.com • Jing Chen(Jerry) He • jinghe@us.ibm.com • Hadoop Summit 2017 • June 15, 2017
  • 2. 2© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system based on graphed metadata • Graph in enterprise data quality analytics solution
  • 3. 3© IBM 2017 Hadoop Summit 2017 Hybrid data analytics and challenges How was “total quantity” calculated? Show me the lineage? What are the source-to-target mappings for the DW? Who read the “sales” data in non-working time? How to ensure data quality? Data Warehouse Architect Auditor Business Person Data Architect
  • 4. 4© IBM 2017 Hadoop Summit 2017 How to handle the challenges? DataGovernance Data Lifecycle Management Data Quality Management •Correctness Consistency Completeness Timeliness Metadata … Master Data management …
  • 5. 5© IBM 2017 Hadoop Summit 2017 What is Metadata? • The data used to describe other data − Simple Metadata − Rich Metadata • inode attributes for file management • Filesystem object attributes include metadata, like modify time, access, owner, permission, etc. File systems metadata • Schema for data management • Ownership information of data • Server/Database information of data DBMS/DW/NOSQL metadata How to manage the metadata cross server/system/platform?
  • 6. 6© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system based on graphed metadata • Graph in enterprise data quality analytics solution
  • 7. 7© IBM 2017 Hadoop Summit 2017 Advantage of Graph in Metadata management Traditional solution • Limited in one server/system • Metadata managed within a server/system Property Graph based solution • Integrate metadata • Handle storage pressure • Efficient Processing and Querying • Lineage • Wild range managed
  • 8. 8© IBM 2017 Hadoop Summit 2017 Property Graph Key1:value1 Key2:value2 Key1:value1 Key2:value2 Label Edge Properties Vertex G = ( V, E ) Graph Vertices Edges label1 • Born for relationship • Intuitive modeling • Expressive querying • Native analysis
  • 9. 9© IBM 2017 Hadoop Summit 2017 Using Graph Analytics to Find Complex Patterns 1st degree relationship 2nd degree relationship 3rd degree relationship • Graph queries are a natural way for analyzing relationship patterns  Less complex than SQL  Can handle high degrees of relationship with ease • Graph schema facilitates visualization and exploration of relationships
  • 10. 10© IBM 2017 Hadoop Summit 2017 Case study - Audit data access • Data theft risk in enterprise in hybrid – Most data stolen by internal person. – Most data theft happened in non-working time. – Over-granting of privileges may cause data theft.
  • 11. 11© IBM 2017 Hadoop Summit 2017 Enterprise data quality analytics system based on graphed metadata Data ingest finance data Consumption data Credit data Behavioral data Graphed metadata … Feature Selection Statistical learning Data analysis (Graphed) Metadata analysis … Advanced Feature Selection Gradient Boosting Decision Tree Support Vector Machine Random Forests PageRank(Graph) … Modeling Customer risk rating Consumption Capacity Graph model … Recommendation Consumer behavior Fraud detection Risk analytics(Audit) …
  • 12. 12© IBM 2017 Hadoop Summit 2017 Data ingest user programData Run Read name, job id, params, config, inputs, outputs, start_ts, finish_ts, … id, name, group, permission, … name, size, location, department, permission, parent, children, … ts_hour, ts_min, ts_sec, status, … Metadata Integration Graph-based Traversal • User • Program • Data • … •Entitles  Vertices • User run program • Program read data • … Relationships  Edges • Name • …. Attributes  Properties Identify entities and relationships Metadata to Graph
  • 13. 13© IBM 2017 Hadoop Summit 2017 Feature Selection Who read the sensitive sales data in non-working time? Query: userFeaSele = graph.traversal(). V().has("department","sales").inE("read").outV().hasLabel('progra m').inE("run").has(“ts_hour",not(within(9,17))).outV() Find the user who has the access to large amount data? Query: … withComputer(SparkGraphComputer) … userAdvFeaSele = userFeaSele.pageRank().by('pageRank').order().by('pageRank').li mit(30) FeatureSelection AdvancedFeature Selection
  • 14. 14© IBM 2017 Hadoop Summit 2017 Modeling • Modeling risk analysis with graphed metadata, information in ERP. • Analyze the user with employee information from ERP, with years of working, age, role, to identify suspect. A non-sales person, for example, an application R&D person, will be the suspect. • Audit Recommendation. Risk analysis model Graph: User List(userAdvFeaSele) ERP: Employee information ERP: Violation information Audit Recommendation Risk analysis report Suspects who stole sensitive data Advanced Feature Selection Other system
  • 15. 15© IBM 2017 Hadoop Summit 2017 Agenda • Challenges in hybrid data analytics • Enterprise data quality analytics system based on graphed metadata • Graph in enterprise data quality analytics solution
  • 16. 16© IBM 2017 Hadoop Summit 2017 User data Machine data log data Behavioral data Graphed metadata Enterprise data quality system Feature analysis Lineage Metadata management Cleansing Hadoop HBase Hive HDFS Spark Titan Solr … Data Source third-party data Ingest(load) Business Application Risk management Data audit Graph in enterprise data quality analytics solution …… Cost analytics
  • 17. 17© IBM 2017 Hadoop Summit 2017 How to choose Enterprise Graph Database? Data storing features Operation and manipulation features Graph data structures Query features Schema and instance representation Easy and centralized Management Expose service Security features Fast computing Evaluate Graph database from following perspective:
  • 18. 18© IBM 2017 Hadoop Summit 2017 Titan • What is Titan − Distributed Graph Database − Based on TinkerPop (Gremlin) − Open Source • Titan Features − Distribute − Scalable : billions edges and vertices − Real-time − Transactional database (concurrent users/ACID/..) − Global graph compute: graph data analytics, report, ETL − Search: geo, numeric range, and full text search
  • 19. 19© IBM 2017 Hadoop Summit 2017 Titan solution architecture application Management API TinkerPop API - Gremlin Internal API layer Database layer(Tx, Data, Mgmt, Optimizer) OLAPI/O Interface Storage and Index Interface Layer HBase Storage Backend Solr External Index Backend Spark Big Data Platform Gremlin GraphComputer OLAP OLTP Hadoop  Optimized for storing and querying billions of vertices and edges over a cluster  Supports thousands of concurrent users  Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)
  • 20. 20© IBM 2017 Hadoop Summit 2017 Backend – HBase & Solr • HBase − Tight integration with the Hadoop ecosystem. − Native support for strong consistency. − Linear scalability with the addition of more machines. − Strictly consistent reads and writes. − Convenient base classes for backing Hadoop MapReduce jobs with HBase tables. − Support for exporting metrics via JMX. − Open source under the liberal Apache 2 license. • Solr − Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. − Solr is a standalone enterprise search server with a REST-like API. − Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation Easy and centralized Management Expose service Security features Fast computing
  • 21. 21© IBM 2017 Hadoop Summit 2017 Integration and management Titan in Ambari Titan Deployment Installation Uninstallation Titan client deployment Titan server deployment Titan server operation Start server Stop server Service check Titan Configuration Hbase backend Solr backend SparkGraphComputer Titan server Titan environment Titan security Titan security support SSL SASL LDAP Kerberos Knox HBase Access control  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management Expose service Security features Fast computing
  • 22. 22© IBM 2017 Hadoop Summit 2017 Remote Titan service Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O Storage and Index Interface Layer HBase Solr Spark Gremlin GraphComputer Gremlin Server Gremlin Console Titan Engine {RESTful} {Web Socket} Gremlin> local Titan server Titan client  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management  Expose service Security features Fast computing
  • 23. 23© IBM 2017 Hadoop Summit 2017 Cluster Remote Titan clientTitan server Titan security enhancement Spark Gremlin Graph Computer local Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O Interface Storage and Index Interface Layer HBase Solr SSL Knox SASL LDAP/OS /Kerberized Titan user HBase Access control Kerberized Cluster Security Description  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management  Expose service  Security features Fast computing
  • 24. 24© IBM 2017 Hadoop Summit 2017 Integrate TinkerPop SparkGraphComputer with Titan DB Mgmt API TP API - Gremlin Internal API layer Database layer OLAPI/O Interface Storage and Index Interface Layer HBase Solr Gremlin GraphComputer Graph RDD PageRankVertexProgram PeerPressureVertexProgram BulkDumperVertexProgram BulkLoaderVertexProgram TraversalVertexProgram Spark-gremlin SparkGraphComputer Hadoop gremlin Spark  Data storing features  Operation and manipulation features  Graph data structures  Query features  Schema and instance representation  Easy and centralized Management  Expose service  Security features  Fast computing
  • 25. 25© IBM 2017 Hadoop Summit 2017 Open source Graph Database A new Linux Foundation project formed to continue development of the TitanDB graph database. http://janusgraph.org Last Titan 1.0.0 was release on Sep 20 2015
  • 26. 26© IBM 2017 Hadoop Summit 2017 References & Contacts • IBM Open Platform for Apache Hadoop and Apache Spark − https://www.ibm.com/us-en/marketplace/ibm-open-platform − https://www.ibm.com/support/knowledgecenter/SSPT3X/SSPT3X_welcome.html • Graph relevant − Titan: http://titan.thinkaurelius.com − JanusGraph: http://janusgraph.org − TinkerPop: https://tinkerpop.apache.org Jun(Terry) Yang yangjuncn@cn.ibm.com Linkedin.com/in/terryjunyang Jing Chen(Jerry) He jinghe@us.ibm.com Linkedin.com/in/jing-chen-jerry-he-1553511
  • 27. 27© IBM 2017 Hadoop Summit 2017 zzzz z z z Thanks! Questions?