SlideShare a Scribd company logo
An introduction and
evaluations of
a wide area distributed
storage system
DRDisaster Recovery
An introduction and evaluations of  a wide area distributed storage system
1978
Sun Information Systems
mainframe
hot site
80s
Realtime
Processing
POSpoint of sales
90s
the Internet
An introduction and evaluations of  a wide area distributed storage system
2001.9.11
September 11 attacks
2003.8.14
Northeast blackout of 2003
in Japan
2011.3.11The aftermath of the 2011
Tohoku earthquake and tsunami
BCPBusiness Continuity Plan
Eurasian
plate
North
American
Plate
Pacific
Ocean
Plate
Philippine Sea
Plate
epicenter
of 3.11
Nankai
(South Sea)
Trough
[NEXT]
Gunnma
Gunnma
Gunnma
Ishikari
Ishikari
Is Two
enough ?
cost
National Institute of
Informatics
Trans-Japan
Inter-Cloud
Testbed
An introduction and evaluations of  a wide area distributed storage system
Kitami Institute
of Technology
University of the
Ryukyus
SINET
the longest
path
Cybermedia Center
Osaka University
Kitami Institute
of Technology
University of the
Ryukyus
XenServer
6.0.2
CloudStack
4.0.0
XenServer
6.0.2
CloudStack
4.0.0
problems
shared
storage
≒50ms
RTT
> 200ms
Storage XenMotion
Live Migration
without shared storage
> XenServer 6.1
VSAvSphere Storage Appliance
VMware
VSAN
WIDE cloud
different translate
Distributed
Storage
requirement
64 256 1024 409616384655362621441.04858e+064.1943e+061.67772e+076.71089e+074
16
64
256
1024
4096
16384
0
20000
40000
60000
80000
100000
120000
Kbytes/sec
File size in 2^n KBytes
Record size in 2^n Kbytes
0
20000
40000
60000
80000
100000
120000
High
Random R/W
Performance
POSIX file system
interface protocl
NFS, CIFS, iSCSI
RICCRegional InterCloud Committee
Distcloudwidely distributed virtualization
infrastructure
Con$idential
Global VM migration is also available by sharing "storage space" by VM host machines.
Real time availability makes it possible. Actual data copy follows.
(VM operator need virtually common Ethernet segment and fat pipe for memory copy)
TOYAMA site
OSAKA site
TOKYO site
before Migration
Copy to DR-sites
Copy to DR-sites
live migration of VM
between distributed areas
real time and active-active features seem to be just a simple "shared storage".
Live migration is also possible between DR sites
(it requires common subnet and fat pipe for memory copy, of course)
after Migration
Copy to DR-sites
Con$idential
Front-end servers aggregate client requests (READ / WRITE) so that,
lots of back-end servers can handle user data in parallel & distributed manner.
Both of performance & storage space are scalable, depends on # of servers.
front-end
(access server)
Access Gateway
(via NFS, CIFS or similar)
clients back-end
(core server)
WRITE req.
write
blocks
read blocks
READ req.
scalable performance &
scalable storage size
by parallel & distributing
processing technology
File
block block block
block block block
block block block
Meta
Data
consistent
hash
backend
(core servers)
An introduction and evaluations of  a wide area distributed storage system
Con$idential
1. assign a new unique ID for any updated block (to ensure consistency).
2. make replication in local site (for quick ACK) and update meta data.
3. make replication in global distributed environment (for actual data copies).
back-end
(multi-sites)
a file, consisted from many blocks
multiplicity in multi-location,
makes each user data,
redundant in local, at first,
3 distributed copies, at last.
(2) create 2 copies in local
for each user data,
write META data,
ant returns ACK
(1)
(1')
(3-a)
(3-a)
(3-a) make a copy
in different location
right after ACK.
(3-b) remove one
of 2 local blocks,
in a future.
(3-b)
(1) assign a new unique ID
for any updated block, so that,
ID ensures the consistency
Most important !
the key for "distributed replication"
NFS
CIFS
iSCSI
redundancy
= 3
r = 2
ACK
r = 1
r = 0
write
dundancy
= 3
ACK
r = 2
e = 0
r = 1
e = 0
r = 0
e = 1
r = -1
e = 2
external
10Gbps
VMs
core
servers
access server
(nfsd)
VM images
VM
image
chunks
virtualization
host
An introduction and evaluations of  a wide area distributed storage system
316 km
440 km
690 km
Hiroshima
Univ.
Kanazawa
Univ.
Hiroshima Univ. Kanazawa Univ.
NII
VMM: virtual machine monitor
CS: core servers
HS: hint servers
AS: access servers
AS AS
VMM VMM
CS CS CS CS CS CSHS HS
CS CS CSHS
L3VPN
L3VPN
L2VPN
L2VPN
L2VPN
L2VPN
L3VPN
EXAGE-LAN
EXAGE-LAN
admin
LAN
admin
LANMIGRATION-LAN
EXAGE-LAN
MIGRATION-LAN
An introduction and evaluations of  a wide area distributed storage system
iozone -aceI
a: full automatic mode
c: Include close() in the timing calculations
e: Include flush (fsync,fflush) in the timing calculations
I: Use DIRECT_IO if possible for all file operations.
write
64 256 1024 409616384655362621441.04858e+064.1943e+061.67772e+076.71089e+074
16
64
256
1024
4096
16384
0
20000
40000
60000
80000
100000
120000
Kbytes/sec
File size in 2^n KBytes
Record size in 2^n Kbytes
0
20000
40000
60000
80000
100000
120000
64 256 1024 4096 16384655362621441.04858e+064.1943e+061.67772e+076.71089e+07
4
16
64
256
1024
4096
16384
File size in 2^n KBytes
Recordsizein2^nKbytes
write read re-readre-write
random read backwords read records rewrite
strided read
random write
fwrite
file size [Bytes] file size [Bytes] file size [Bytes]
recordsize[KB]
recordsize[KB]recordsize[KB]
4
16
64
256
1024
4096
16384
4
16
64
256
1024
4096
16384
0
20
40
60
80
100
120
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB
MB/sec
4
16
64
256
1024
4096
16384
frewrite
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
write rewrite read reread
random read random write bkwd read
stride read fwrite fread
legend
record rewrite
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
従来方式 Exage/Storage
広域対応 Exage/Storage
SINET4 Hiroshima University EXAGE L3VPN
SINET4 Kanazawa University EXAGE L3VPN
core
servers
KVM host
access
server
distcloud
NFS server
access
server
Kanazawa
Univ.
Hiroshima
Univ.
proposed
method
(read)
NFS
(read)
decline of
throughput
by latency
start
live migration
                       
proposedmethod shared NFS
Read (before migration) Read (after migration)
Write (before migration) Write (after migration)
Throughput(MB/sec)
An introduction and evaluations of  a wide area distributed storage system
SC2013
2013/11/17∼22
@Colorado Convention Center
Ikuo Nakagawa
INTEC Inc. / Osaka University
Kohei Ichikawa
Nara Institute of Science and Technology
We have been developing a widely distributed cluster storage system and
evaluating the storage along with various applications. The main advantage of
our storage is its very fast random I/O performance, even though it provides a
POSIX compatible file system interface on the top of distributed cluster storage.
an initial plan
s
Shinji Shimojo
Director of JGN-X, NICT
It s not so
fun.
real stage
24,000 km
RTT=244ms
1Gbps
loop
back
real stage
Blocks (chunks)
are located
on the nearest
consistent
hash
Meta data
is not suitable
for wide area
type of
line
load
condition
required time
(sec)
domestic no load 17.9
international
no load 201.6
read load 175.4
write load 400.6
required time to migration IO performance
type of 

access pattern
load

condition
domestic

(read) 64.6
domestic

(write) 58.7
international

(read) 25.4
international

write) 20.9
average throughput (MB/s) of dd
Live migration
demo on an
international
line
Evaluations of
distcloud on
an international
line
Disaster
Recovery
demonstration
of DC down
U.S. region
will be build
soon
Future Works
An introduction and evaluations of  a wide area distributed storage system
SC142014/11/16∼21
@Ernest N. Morial Convention Center
Big Data
Analysis
behavior data
from
mobile devices
data from
non-electrification
area
mobile
devices
sensor
devices
personal data
aggregation service
high
latency power
consumption
mobile
devices
sensor
devices
low
latency
wide-area distributed
platform
regional
exchange
regional
exchange
personal data
aggregation service
route
optimization
the Internet
distcloud storage
region A region B region C
live migration
optimize routes with
remaining independence of
each region
users from the Internet
can access the VM
after live migration
Layer method outline features
L3
routing
update routing table

by each migrations
○ routing per region
cannot routing per VM

routing operation cost
routing

+
L2 extension
VPLS, IEEE802.1ad PB(Q in Q)
IEEE802.1ah (Mac-in-Mac)
○ stability, operation cost

poor scalability
L2 over L3 VXLAN, OTV, NVGRE
○ stability

overhead of tunneling

IP multicast
SDN OpenFlow
○ programable operation

cost of equipment
ID/locator separation LISP
○ scalability, routing per VM

cost, immediacy
IP mobility MAT, NEMO, MIP (Kagemusha)
○ scalability

load of router
L4 mSCTP SCTP multipath
○ independent from L2/L3

limited in SCTP
L7 DNS + reverseNAT Dynamic DNS
○ independent from L2/L3

altering IP addr.

closing connection
2011.3.11The aftermath of the 2011
Tohoku earthquake and tsunami
https://www.flickr.com/photos/idvsolutions/7439877658/sizes/o/in/photostream/
An introduction and evaluations of  a wide area distributed storage system
2014 Storage Developer Conference. © Osaka University All Rights Reserved.
go on to the
next stage

More Related Content

PPTX
Storage
PDF
NUSE (Network Stack in Userspace) at #osio
PDF
High Availability Storage (susecon2016)
PDF
Improvements in GlusterFS for Virtualization usecase
PDF
4th RICC workshopのご案内
PDF
Distributed replicated block device
PDF
FOSDEM2015: Live migration for containers is around the corner
PDF
Faster and Smaller qcow2 Files with Subcluster-based Allocation
Storage
NUSE (Network Stack in Userspace) at #osio
High Availability Storage (susecon2016)
Improvements in GlusterFS for Virtualization usecase
4th RICC workshopのご案内
Distributed replicated block device
FOSDEM2015: Live migration for containers is around the corner
Faster and Smaller qcow2 Files with Subcluster-based Allocation

What's hot (20)

PDF
mTCP使ってみた
PPTX
Userspace Linux I/O
PDF
Evaluation of RBD replication options @CERN
PPTX
.NET Memory Primer
PPTX
Collaborate vdb performance
PDF
GlusterFS CTDB Integration
PDF
Improving the Performance of the qcow2 Format (KVM Forum 2017)
ODP
CRIU: Time and Space Travel for Linux Containers
PPTX
Accelerating hbase with nvme and bucket cache
PDF
Disaster recovery of OpenStack Cinder using DRBD
PPT
FreeNAS backup solution
 
PDF
Direct Code Execution - LinuxCon Japan 2014
PDF
Lustre Generational Performance Improvements & New Features
PDF
2. Vagin. Linux containers. June 01, 2013
PPTX
Prosit google-cloud
PDF
Linux Kernel Library - Reusing Monolithic Kernel
PDF
Ceph Month 2021: RADOS Update
ODP
Checkpoint/restore of containers with CRIU
PDF
Xen in Linux 3.x (or PVOPS)
PPT
NFS and Oracle
mTCP使ってみた
Userspace Linux I/O
Evaluation of RBD replication options @CERN
.NET Memory Primer
Collaborate vdb performance
GlusterFS CTDB Integration
Improving the Performance of the qcow2 Format (KVM Forum 2017)
CRIU: Time and Space Travel for Linux Containers
Accelerating hbase with nvme and bucket cache
Disaster recovery of OpenStack Cinder using DRBD
FreeNAS backup solution
 
Direct Code Execution - LinuxCon Japan 2014
Lustre Generational Performance Improvements & New Features
2. Vagin. Linux containers. June 01, 2013
Prosit google-cloud
Linux Kernel Library - Reusing Monolithic Kernel
Ceph Month 2021: RADOS Update
Checkpoint/restore of containers with CRIU
Xen in Linux 3.x (or PVOPS)
NFS and Oracle
Ad

Viewers also liked (9)

PDF
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
PDF
Alluxio: Unify Data at Memory Speed; 2016-11-18
PPTX
Soto lecture aclu_11-29-2016
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
PDF
PDF
10 more lessons learned from building Machine Learning systems
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio: Unify Data at Memory Speed; 2016-11-18
Soto lecture aclu_11-29-2016
Spark Summit EU talk by Jiri Simsa
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
10 more lessons learned from building Machine Learning systems
Ad

Similar to An introduction and evaluations of a wide area distributed storage system (20)

PPTX
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
PDF
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
PPT
FalconStor NSS Presentation
PPT
3PAR and VMWare
PDF
CLFS 2010
PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
PDF
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PPTX
Ceph Day New York 2014: Ceph, a physical perspective
PPT
BWC Supercomputing 2008 Presentation
PDF
LUG 2014
PPTX
ClickOS_EE80777777777777777777777777777.pptx
PDF
Hot Cloud'16: An Experiment on Bare-Metal BigData Provisioning
PDF
Memory, Big Data, NoSQL and Virtualization
PDF
Data Grids with Oracle Coherence
PDF
Galaxy CloudMan performance on AWS
PDF
600M+ Unsuspecting FreeBSD Users (MeetBSD California 2014)
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PDF
Shak larry-jeder-perf-and-tuning-summit14-part2-final
PDF
Java File I/O Performance Analysis - Part I - JCConf 2018
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
FalconStor NSS Presentation
3PAR and VMWare
CLFS 2010
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
PASTE: Network Stacks Must Integrate with NVMM Abstractions
Ceph Day New York 2014: Ceph, a physical perspective
BWC Supercomputing 2008 Presentation
LUG 2014
ClickOS_EE80777777777777777777777777777.pptx
Hot Cloud'16: An Experiment on Bare-Metal BigData Provisioning
Memory, Big Data, NoSQL and Virtualization
Data Grids with Oracle Coherence
Galaxy CloudMan performance on AWS
600M+ Unsuspecting FreeBSD Users (MeetBSD California 2014)
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Java File I/O Performance Analysis - Part I - JCConf 2018

More from Hiroki Kashiwazaki (20)

PDF
技術的特異点より向こうの世界
PDF
過去と未来の災害シナリオを用いた耐災害性を検証・評価するためのネットワークエミュレータの実装
PDF
情報技術者の社会的責任 2014 第15回 (最終回)
PDF
JANOG35 ネットワーク災害訓練 BoF
PDF
情報技術者の社会的責任 2014 第14回
PDF
情報技術者の社会的責任 2014 第13回
PDF
情報技術者の社会的責任 2014 第12回
PDF
情報技術者の社会的責任 2014 第11回
PDF
情報技術者の社会的責任 2014 第10回
PDF
情報技術者の社会的責任 2014 第9回
PDF
情報技術者の社会的責任 2014 第8回
PDF
情報技術者の社会的責任 2014 第7回
PDF
情報技術者の社会的責任 2014 第6回
PDF
情報技術者の社会的責任 2014 第5回
PDF
How to make a curry for single researchers. 独身系研究者のためのカレーの作り方
PDF
情報技術者の社会的責任 2014 第4回
PDF
見せてもらおうか、VMware社のvCloud Airの性能とやらを
PDF
分散システムの耐災害性・耐障害性の検証・評価・反映を行うプラットフォームの設計
PDF
情報技術者の社会的責任 2014 第3回
PDF
情報技術者の社会的責任 2014 第2回
技術的特異点より向こうの世界
過去と未来の災害シナリオを用いた耐災害性を検証・評価するためのネットワークエミュレータの実装
情報技術者の社会的責任 2014 第15回 (最終回)
JANOG35 ネットワーク災害訓練 BoF
情報技術者の社会的責任 2014 第14回
情報技術者の社会的責任 2014 第13回
情報技術者の社会的責任 2014 第12回
情報技術者の社会的責任 2014 第11回
情報技術者の社会的責任 2014 第10回
情報技術者の社会的責任 2014 第9回
情報技術者の社会的責任 2014 第8回
情報技術者の社会的責任 2014 第7回
情報技術者の社会的責任 2014 第6回
情報技術者の社会的責任 2014 第5回
How to make a curry for single researchers. 独身系研究者のためのカレーの作り方
情報技術者の社会的責任 2014 第4回
見せてもらおうか、VMware社のvCloud Airの性能とやらを
分散システムの耐災害性・耐障害性の検証・評価・反映を行うプラットフォームの設計
情報技術者の社会的責任 2014 第3回
情報技術者の社会的責任 2014 第2回

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPTX
sap open course for s4hana steps from ECC to s4
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
sap open course for s4hana steps from ECC to s4
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Assigned Numbers - 2025 - Bluetooth® Document
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...

An introduction and evaluations of a wide area distributed storage system