8. Advanced
Compu=ng
and
Op=miza=on
Infrastructure
for
Extremely
Large-‐Scale
Graphs
on
Post
Peta-‐Scale
Supercomputers
• JST(Japan
Science
and
Technology
Agency)
CREST(Core
Research
for
Evoluonal
Science
and
Technology)
Project
(Oct,
2011
䡚㻌
March,
2017)
• 4
groups,
over
60
members
1. Fujisawa-‐G
(Kyushu
University)
:
Large-‐scale
Mathemacal
Opmizaon
2. Suzumura-‐G
(University
College
Dublin,
Ireland)
:
Large-‐scale
Graph
Processing
3. Sato-‐G
(Tokyo
Instute
of
Technology)
:
Hierarchical
Graph
Store
System
4. Wakita-‐G
(Tokyo
Instute
of
Technology)
:
Graph
Visualizaon
• Innova=ve
Algorithms
and
implementa=ons
• Opmizaon,
Searching,
Clustering,
Network
flow,
etc.
• Extreme
Big
Graph
Data
for
emerging
applicaons
• 230
~
242
nodes
and
240
~
246
edges
• Over
1M
threads
are
required
for
real-‐me
analysis
• Many
applicaons
on
post
peta-‐scale
supercomputers
• Analyzing
massive
cyber
security
and
social
networks
• Opmizing
smart
grid
networks
• Health
care
and
medical
science
• Understanding
complex
life
system
28. The 2nd Green Graph500 list on Nov. 2013
• Measures power-efficient using TEPS/W ratio
• Results on various system such as Huawei’s RH5885v2 w/
Tecal ES3000 PCIe SSD 800GB * 2 and 1.2TB * 2
• http://green.graph500.org
30. Tokyo’s Institute of Technology
GraphCREST-Custom #1
is ranked
No.3
in the Big Data category of the Green Graph 500
Ranking of Supercomputers with
35.21 MTEPS/W on Scale 31
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
31. Lessons
from
our
Graph500
acvies
• We
can
efficiently
process
large-‐scale
data
that
exceeds
the
DRAM
capacity
of
a
compute
node
by
ulizing
commodity-‐based
NVM
devices
• Convergence
of
praccal
algorithms
and
sodware
implementaon
techniques
is
very
important
• Basically,
BigData
consists
of
a
set
of
sparse
data.
Converng
sparse
datasets
to
dense
is
also
a
key
for
performing
BigData
processing
33. Hamar
Overview
Rank
0 Rank
1 Rank
n
Map
Local
Array Local
Array Local
Array Local
Array
Distributed
Array
Reduce
Map
Reduce
Map
Reduce
Shuffle
Shuffle
Data
Transfer
between
ranks
Shuffle
Shuffle
Local
Array Local
Array Local
Array Local
Array
Local
Array
on
NVM Local
Array
on
NVM Local
Device(GPU)
Data
Host(CPU)
Data Memcpy
(H2D,
Array
on
NVMVirtualizedL
oDcaalt
Aar
rOayb
ojne
NcVtM
D2H)
34. Applicaon
Example
:
GIM-‐V
Generalized
Iterave
Matrix-‐Vector
mulplicaon*1
• Easy
descripon
of
various
graph
algorithms
by
implemenng
combine2,
combineAll,
assign
funcons
• PageRank,
Random
Walk
Restart,
Connected
Component
– v’
=
M
×G
v
where
v’i
=
assign(vj
,
combineAllj
({xj
|
j
=
1..n,
xj
=
combine2(mi,j,
vj)}))
(i
=
1..n)
– Iterave
2
phases
MapReduce
operaons
Straighporward
implementaon
using
Hamar
v’ 䠙 ×G i mi,j
vj
v’ M
combine2
(stage1)
combineAll
and
assign
(stage2)
assign v
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementaon
and
Observaons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
35. TSUBAME2.5での弱スケーリング
[Shirahata, Sato et al. Cluster2014]
• PageRankアプリケーション
• GPUのメモリを超える規模のグラフを対象(RMAT Graph)
3000
SCALE
23
-‐
24
per
Node
Performance
[MEdges/sec] Number
2500
2000
1500
1000
500
0
0
200
400
600
800
1000
1200
of
Compute
Nodes
1CPU
(S23
per
node)
1GPU
(S23
per
node)
2CPUs
(S24
per
node)
2GPUs
(S24
per
node)
3GPUs
(S24
per
node)
2.81
GE/s
on
3072
GPUs
(SCALE
34)
2.10x
Speedup
(3
GPU
v
2CPU)
36. GPUアクセラレータと不揮発性メモリを考慮した
reliable storage designs for resilient extreme scale computing.
I/O構成法 [Shirahata, Sato et al. HPC141]
3.2 Burst Buffer System
To solve the problems in a flat buffer system, we consider a
burst buffer system [21]. A burst buffer is a storage space to
bridge the gap in latency and bandwidth between node-local stor-age
16
ᯛ䛾㻌mSATA
SSD
䜢⏝䛔䛯䝥䝻䝖䝍䜲䝥䝬䝅䞁䛾タィ
and the PFS, and is shared by a subset of compute nodes.
Although additional nodes are required, a burst ᐜ㔞:
256GB
x
16ᯛ
→
4TB
buffer can offer
a system many advantages including higher reliability and effi-ciency
Read䝞䞁䝗ᖜ:
0.5GB/s
x
16ᯛ
→
over a flat buffer system. A burst buffer system is more
reliable for checkpointing because burst buffers are located on
a smaller number of dedicated I/O nodes, so the probability of
lost checkpoints is decreased. In addition, even if a large number
of compute nodes fail concurrently, an application can still ac-cess
the checkpoints from the burst buffer. A burst buffer system
provides more efficient utilization of storage resources for partial
restart of uncoordinated checkpointing because processes involv-ing
restart can exploit higher storage bandwidth. For example, if
compute node 1 and 3 are in the same cluster, and both restart
from a failure, the processes can utilize all SSD bandwidth unlike
a flat buffer system. This capability accelerates the partial restart
of uncoordinated checkpoint/restart.
Table 1 Node specification
CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)
Memory Cetus DDR3-1600 (16GB)
M/B GIGABYTE GA-Z77X-UD5H
SSD Crucial m4 msata 256GB CT256M4SSD3
(Peak read: 500MB/s, Peak write: 260MB/s)
SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA
Device Converter with Metal Fram
RAID Card Adaptec RAID 7805Q ASR-7805Q Single
8
GB/s
A
single
mSATA
SSD
8
integrated
mSATA
SSDs
RAID
cards
Prototype/Test
machine
38. JST-‐CREST
Extreme
Big
Data
2013-‐2018
(PI:
Matsuoka)
Future Non-Silo Extreme Big Data Apps
Co-Design
Co-Design
Co-Design 日本地図13/06/06 22:36
EBD System Software
incl. EBD Object System
NVM/
Flash
NVM/
Flash
NVM/
Flash
DRAM
DRAM
DRAM
2Tbps HBM
4~6HBM Channels
1.5TB/s DRAM
NVM BW
30PB/s I/O BW Possible
1 Yottabyte / Year
TSV Interposer
NVM/
Flash
NVM/
Flash
NVM/
Flash
DRAM
DRAM
DRAMEBD
Bag
Cartesian
Plane
KV
S
KV
S
EBD
KVS
1000km
KV
S
Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW
Supercomputers
ComputeBatch-Oriented
Cloud䚷IDC
Very low BW Efficiencty
PCB
High Powered
Main CPU
Low
Power
CPU
Low
Power
CPU
Introduction
Problem Domain
In most living organisms their development molecule called DNA consists of called nucleotides.
The four bases found A), cytosine (C), Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Algorithm Large Scale
Metagenomics
Massive Sensors and
Data Assimilation in
Weather Prediction
Ultra Large Scale
Graphs and Social
Infrastructures
Exascale Big Data HPC
Graph
Store
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
39. Tasks
5-‐1~5-‐3 Task6
Problem Domain
In most their development molecule DNA consists called nucleotides.
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction
100,000
Times
Fold
EBD
“Convergent”
System
Overview
Problem Domain
EBD
Performance
Modeling
Evaluaon
Task
4
To decipher the information contained we need to determine the order This task is important for many areas of science and 日本地図medicine.
13/06/06 22:36
Cartesian
Plane
Modern sequencing techniques KVS
molecule into pieces (called reads) KVS
processed separately KVS
to increase sequencing throughput.
1000km
Reads must be aligned file:///Users/shirahata/Pictures/日本地図.svg to the 1/1 ページ
reference
sequence to determine their position molecule. This process is called alignment.
Task
3
EBD
Programming
System
Graph
Store
EBD
Applicaon
Co-‐
Design
and
Validaon
Ultra
High
BW
Low
Latency
NVM Ultra
High
BW
Low
Latency
NW
Processor-‐in-‐memory 3D
stacking
Large
Scale
Genomic
Correlaon
Data
Assimilaon
in
Large
Scale
Sensors
and
Exascale
Atmospherics
Large
Scale
Graphs
and
Social
Infrastructure
Apps
TSUBAME
3.0
TSUBAME
2.5/KFC
EBD
“converged”
Real-‐Time
Resource
Scheduling
EBD
Distrbuted
Object
Store
on
100,000
NVM
Extreme
Compute
and
Data
Nodes
Task
2
EBD
Bag
EBD
KVS
Ultra
Parallel
Low
Powe
I/O
EBD
“Convergent”
Supercomputer
~10TB/s䋻~100TB/s䋻~10PB/s
Task
1