SlideShare a Scribd company logo
Scott Miao
2013/12/14
Who am I

•
•
•
•

RD, SPN, Trend Micro
3 years for Hadoop eco system
Expertise in HDFS/MR/HBase
@takeshi.miao
THREATCONNECT
Product 1 Product 2

Product 3

…

IP, domain, URL, filename, process, file hash,
Virus detection, registry key, etc.

Sandbox
APT KB

Threat
Connect

Virus
DB

TE
Family
Writeup

File
Detecti
on

Threat
Web

Web
Reputa
tion

Process and
correlates different
data sources

Most relevant threat
report with actionable
intelligence
on a single portal
A GRAPH
The problems
• Store large size of Graph data
• Access large size of Graph data
• Process large size of Graph data
大
數
據
STORE
Property Graph Model (1/3)

https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
Property Graph Model (2/3)
• A property graph has these elements
– a set of vertices
•
•
•
•

each vertex has a unique identifier.
each vertex has a set of outgoing edges.
each vertex has a set of incoming edges.
each vertex has a collection of properties defined by a map from
key to value.

– a set of edges
•
•
•
•

each edge has a unique identifier.
each edge has an outgoing tail vertex.
each edge has an incoming head vertex.
each edge has a label that denotes the type of relationship
between its two vertices.
• each edge has a collection of properties defined by a map from
key to value.
Property Graph Model (3/3)
The domain model for
Property Graph Model
The relational model for
Property Graph Model
Massive
scalable ?

Active
community ?

Analyzable ?
The winner is…
• We use HBase as a Graph Storage
– Google BigTable and PageRank
– HBaseCon2012

Yeah
We are NO. 1 !!
Use HBase to store Graph data (1/3)
• Schema design
– Table: vertex
‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’,
<property-value>

– Table: edge
‘<vertex1-row-key>--><label>--><vertex2-row-key>’,
‘property:<property-key>@<property-value-type>’, <property-value>
Use HBase to store Graph data (2/3)
• Sample
– Table: vertex
‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’
‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’
…
‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’
‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’

– Table: edge
‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’,
‘property:property1’, ‘…’
‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’,
‘property:property2’, ‘…’
Use HBase to store Graph data (3/3)
• Tables
– create 'test.vertex', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo',
TTL => '7776000'}
– create 'test.edge', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo',
TTL => '7776000'}
It’s not me,
actually…

ACCESS
3. Process Data
2. Get Data
HBase

Clients

Algorithms

1. Put data

Data Sources
Put Data
• HBase schema design is simple and humanreadable
• They are easy to write your own dumping tool
as you need
– MR/Pig/Completebulkload
– Can write cron-job to clean up the broken-edge
data
– TTL can also help to retire old data

• We already have a lot practices for this task
Get Data (1/2)
• A Graph API
• A better semantics for manipulating Graph
data
– As a wrapper for HBase Client API
– Rather than use HBase Client API directly

• Simple to Use
Vertex vertex = this.graph.getVertex("40012");
Vertex subVertex = null;
Iterable<Edge> edges =
vertex.getEdges(Direction.OUT, "knows", "foo", "bar");
for(Edge edge : edges) {
subVertex = edge.getVertex(Direction.OUT);
...
}
Get Data (2/2)
• We implement blueprints API
– It provides interfaces as spec. for users to impl.
– Currently basic query methods are implemented
– We can get benefits from it
• Other libraries support if we can impl. more degrees of
blueprints API
– http://www.tinkerpop.com/
– RESTful server, graph algorithmn, dataflow, etc
Attack on graph
PROCESS
• Thanks for human-readable HBase schema
design and random accessible in natural
– Write your own MR
– Write your own Pig/UDFs

• Ex. The pagerank
– http://zh.wikipedia.org/wiki/Pagerank
HGraph
• A project is open and put on github
– https://github.com/takeshimiao/HGraph

• A partial impl. released from our internal pilot
project
– Follow HBase schema design
– Read data via Blueprints API
– Process data with pagerank

• Download or ‘git clone’ it
– Use ‘mvn clean package’
– Run on unix-like OS
• Use window may encounter some errors
Attack on graph
Attack on graph
Attack on graph
There is another project

http://thinkaurelius.github.
io/faunus/

http://thinkaurelius.github.io/titan/
Attack on graph
OBSERVATIONS
YARN
• It seems bring Hadoop to a de-facto big data
platform
– Loose bound the MR framework and
accommodate others

• There are bunch of data processing migrated
with it
http://hortonworks.com/hadoop/yarn/
SQL-on-Hadoop
• Impala V.S. Hive (Stinger and Tez)
– Impala seems more mature than Hive
Hive built on top of a batch processing framework (even MRv2), but Impala goes
itself own way !!
Todd Lipcon
Committer/PMC member on Apache Thrift, HBаse, and Hаdoop projects

• YARN !!
– Hive stinger and Tez are based on YARN (HDP2)
– Impala also has plan to migrated to YARN (CDH5)
– Even HBase !! (HOYA)
HBase is a popular noSQL
• As I saw in Europe/CA/China, I can say HBase
is most popular noSQL solution if you already
adopted Hadoop
• Other noSQLs will not help you out of OPS
paintpoints
• So the best way is to pick your right tool and
play it well
Attack on graph
http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1 #p34
Attack on graph

More Related Content

PPTX
PPTX
Apache Hive
PPTX
Map Reduce
PPTX
Apache hive introduction
PPT
Apache Hive - Introduction
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PPT
Introduction to Apache Hadoop
PPTX
Unit 5-apache hive
Apache Hive
Map Reduce
Apache hive introduction
Apache Hive - Introduction
Hw09 Hadoop Development At Facebook Hive And Hdfs
Introduction to Apache Hadoop
Unit 5-apache hive

What's hot (20)

PPTX
Hadoop workshop
PPTX
Introduction to Pig
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PPT
Introduction To Map Reduce
PPT
Hadoop hive presentation
PPT
Unit 5-lecture4
PPTX
Introduction to Hive
PDF
Mar 2012 HUG: Hive with HBase
PPTX
An intriduction to hive
PPT
hadoop&zing
PDF
Hadoop and Hive Development at Facebook
PPT
Hive(ppt)
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PDF
R, Hadoop and Amazon Web Services
PPTX
PDF
Dynamic Draph / Iterative Computation on Apache Giraph
PPTX
Apache Hadoop Big Data Technology
PDF
Hadoop pig
PDF
HBaseCon 2015: Just the Basics
PPTX
Hive and HiveQL - Module6
Hadoop workshop
Introduction to Pig
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction To Map Reduce
Hadoop hive presentation
Unit 5-lecture4
Introduction to Hive
Mar 2012 HUG: Hive with HBase
An intriduction to hive
hadoop&zing
Hadoop and Hive Development at Facebook
Hive(ppt)
Introducing Apache Giraph for Large Scale Graph Processing
R, Hadoop and Amazon Web Services
Dynamic Draph / Iterative Computation on Apache Giraph
Apache Hadoop Big Data Technology
Hadoop pig
HBaseCon 2015: Just the Basics
Hive and HiveQL - Module6
Ad

Similar to Attack on graph (20)

PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
PDF
Sept 17 2013 - THUG - HBase a Technical Introduction
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PPT
Brust hadoopecosystem
PPTX
03 pig intro
PPTX
מיכאל
PDF
Michael stack -the state of apache h base
PPTX
Introduction to Hadoop and Big Data
PDF
Hadoop and Hive Development at Facebook
 
PDF
1.6 米嘉 gobuildweb
PPTX
HBase_-_data_operaet le opérations de calciletions_final.pptx
PPT
Meethadoop
PDF
"R, Hadoop, and Amazon Web Services (20 December 2011)"
ODP
Large scale crawling with Apache Nutch
PPTX
End-to-end Data Governance with Apache Avro and Atlas
PDF
Architecting applications with Hadoop - Fraud Detection
PDF
Data Infrastructure for a World of Music
PDF
Large Scale Crawling with Apache Nutch and Friends
ODP
Large Scale Crawling with Apache Nutch and Friends
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Sept 17 2013 - THUG - HBase a Technical Introduction
Apache Arrow (Strata-Hadoop World San Jose 2016)
Brust hadoopecosystem
03 pig intro
מיכאל
Michael stack -the state of apache h base
Introduction to Hadoop and Big Data
Hadoop and Hive Development at Facebook
 
1.6 米嘉 gobuildweb
HBase_-_data_operaet le opérations de calciletions_final.pptx
Meethadoop
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Large scale crawling with Apache Nutch
End-to-end Data Governance with Apache Avro and Atlas
Architecting applications with Hadoop - Fraud Detection
Data Infrastructure for a World of Music
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Ad

More from Scott Miao (12)

PPTX
My thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
PPTX
20171122 aws usergrp_coretech-spn-cicd-aws-v01
PPTX
Achieve big data analytic platform with lambda architecture on cloud
PPTX
analytic engine - a common big data computation service on the aws
PPTX
Zero-downtime Hadoop/HBase Cross-datacenter Migration
PDF
004 architecture andadvanceduse
PDF
003 admin featuresandclients
PPTX
006 performance tuningandclusteradmin
PPTX
005 cluster monitoring
PPTX
002 hbase clientapi
PPTX
001 hbase introduction
PPTX
20121022 tm hbasecanarytool
My thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
20171122 aws usergrp_coretech-spn-cicd-aws-v01
Achieve big data analytic platform with lambda architecture on cloud
analytic engine - a common big data computation service on the aws
Zero-downtime Hadoop/HBase Cross-datacenter Migration
004 architecture andadvanceduse
003 admin featuresandclients
006 performance tuningandclusteradmin
005 cluster monitoring
002 hbase clientapi
001 hbase introduction
20121022 tm hbasecanarytool

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
NewMind AI Monthly Chronicles - July 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation theory and applications.pdf

Attack on graph

  • 2. Who am I • • • • RD, SPN, Trend Micro 3 years for Hadoop eco system Expertise in HDFS/MR/HBase @takeshi.miao
  • 4. Product 1 Product 2 Product 3 … IP, domain, URL, filename, process, file hash, Virus detection, registry key, etc. Sandbox APT KB Threat Connect Virus DB TE Family Writeup File Detecti on Threat Web Web Reputa tion Process and correlates different data sources Most relevant threat report with actionable intelligence on a single portal
  • 6. The problems • Store large size of Graph data • Access large size of Graph data • Process large size of Graph data
  • 9. Property Graph Model (1/3) https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
  • 10. Property Graph Model (2/3) • A property graph has these elements – a set of vertices • • • • each vertex has a unique identifier. each vertex has a set of outgoing edges. each vertex has a set of incoming edges. each vertex has a collection of properties defined by a map from key to value. – a set of edges • • • • each edge has a unique identifier. each edge has an outgoing tail vertex. each edge has an incoming head vertex. each edge has a label that denotes the type of relationship between its two vertices. • each edge has a collection of properties defined by a map from key to value.
  • 12. The domain model for Property Graph Model
  • 13. The relational model for Property Graph Model
  • 15. The winner is… • We use HBase as a Graph Storage – Google BigTable and PageRank – HBaseCon2012 Yeah We are NO. 1 !!
  • 16. Use HBase to store Graph data (1/3) • Schema design – Table: vertex ‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’, <property-value> – Table: edge ‘<vertex1-row-key>--><label>--><vertex2-row-key>’, ‘property:<property-key>@<property-value-type>’, <property-value>
  • 17. Use HBase to store Graph data (2/3) • Sample – Table: vertex ‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’ ‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’ … ‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’ ‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’ – Table: edge ‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property1’, ‘…’ ‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property2’, ‘…’
  • 18. Use HBase to store Graph data (3/3) • Tables – create 'test.vertex', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'} – create 'test.edge', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'}
  • 20. 3. Process Data 2. Get Data HBase Clients Algorithms 1. Put data Data Sources
  • 21. Put Data • HBase schema design is simple and humanreadable • They are easy to write your own dumping tool as you need – MR/Pig/Completebulkload – Can write cron-job to clean up the broken-edge data – TTL can also help to retire old data • We already have a lot practices for this task
  • 22. Get Data (1/2) • A Graph API • A better semantics for manipulating Graph data – As a wrapper for HBase Client API – Rather than use HBase Client API directly • Simple to Use Vertex vertex = this.graph.getVertex("40012"); Vertex subVertex = null; Iterable<Edge> edges = vertex.getEdges(Direction.OUT, "knows", "foo", "bar"); for(Edge edge : edges) { subVertex = edge.getVertex(Direction.OUT); ... }
  • 23. Get Data (2/2) • We implement blueprints API – It provides interfaces as spec. for users to impl. – Currently basic query methods are implemented – We can get benefits from it • Other libraries support if we can impl. more degrees of blueprints API – http://www.tinkerpop.com/ – RESTful server, graph algorithmn, dataflow, etc
  • 26. • Thanks for human-readable HBase schema design and random accessible in natural – Write your own MR – Write your own Pig/UDFs • Ex. The pagerank – http://zh.wikipedia.org/wiki/Pagerank
  • 27. HGraph • A project is open and put on github – https://github.com/takeshimiao/HGraph • A partial impl. released from our internal pilot project – Follow HBase schema design – Read data via Blueprints API – Process data with pagerank • Download or ‘git clone’ it – Use ‘mvn clean package’ – Run on unix-like OS • Use window may encounter some errors
  • 31. There is another project http://thinkaurelius.github. io/faunus/ http://thinkaurelius.github.io/titan/
  • 34. YARN • It seems bring Hadoop to a de-facto big data platform – Loose bound the MR framework and accommodate others • There are bunch of data processing migrated with it
  • 36. SQL-on-Hadoop • Impala V.S. Hive (Stinger and Tez) – Impala seems more mature than Hive Hive built on top of a batch processing framework (even MRv2), but Impala goes itself own way !! Todd Lipcon Committer/PMC member on Apache Thrift, HBаse, and Hаdoop projects • YARN !! – Hive stinger and Tez are based on YARN (HDP2) – Impala also has plan to migrated to YARN (CDH5) – Even HBase !! (HOYA)
  • 37. HBase is a popular noSQL • As I saw in Europe/CA/China, I can say HBase is most popular noSQL solution if you already adopted Hadoop • Other noSQLs will not help you out of OPS paintpoints • So the best way is to pick your right tool and play it well