An Active and Hybrid Storage System for Data-intensive Applications

An Active and Hybrid Storage System
for Data-intensive Applications

Ph.D Candidate: Zhiyang Ding
Defense Committee Members:
Dr. Xiao Qin
Dr. Kai H. Chang
Dr. David A. Umphress
University Reader:
Prof. Wei Wang,
Chair of the Art Design Dept.
5/7/2012

Cluster Computing
• Large-scale Data Processing is everywhere.

5/7/2012 2

Motivation
• Traditional Storage Nodes on the Cluster
Storage Node
Head Node (or Storage Area Network)
Internet
Client

Network switch

Compute
Nodes
5/7/2012 3

Motivation
• What’s the next?
• More “Active”.

Head
Internet

Node
Client

Network switch

Storage Node
Compute
Nodes Computation Offload
I/O Request

Raw Data
Pre-processed Data
5/7/2012 4

About the Active Storage

McSD:
A Smart Disk Model

pp-mpiBlast:
How to deploy Active Storage?

Storage Node
HcDD:
Hybrid Disk for Active Storage

5/7/2012 5

McSD:
A Multicore Active Storage Device

• I/O Wall Problem: CPU--I/O Gap
– Limited I/O Bandwidth
– CPU Waiting and
Dissipating the Power
• How to
– Bridge CPU--I/O Gap
– Reduce I/O Traffic

5/7/2012 6

Why McSD?

• “Active”:
– Leveraging the Processing Power of Storage Devices

• Benefits:
– Offloading Data-intensive Computation
– Reducing I/O Traffic
– Pipeline Parallel Programming

5/7/2012 7

Contributions

• Design a prototype of a multicore active storage

• Design a pre-assembled processing module

• Extend a shared-memory MapReduce system

• Emulate the whole system on a real testbed

5/7/2012 8

Background: Active Disks

• Traditional Smart/Active Disks
– On-board: Embedding a processor into the hard disk
– Various Research Models
• e.g. active disk, smart disk, IDISK, SmartSTOR, and etc.

• However, “active disk” is not adopted by hardware vendors

Improved attachment
Cost of the System
technologies

I/O Bound Workloads Reliability

5/7/2012 9

Background: Parallel Processing

• Multi-core Processors or Multi-processors
– 45% transistors increase 20% processing power
• MapReduce: a Parallel Programming Model
– MapReduce by Google
– Hadoop, Mars, Phoenix, and etc.
• Multicore and Shared-memory Parallel
Processing

5/7/2012 10

Design: System Overview

Pipeline Parallel
Processing

Communication
Mechanism
Multicore and
Shared-memory
Parallel Processing
Hybrid Storage Disks

Design of an Active
Storage

5/7/2012 11

Design and Implementation

• Computation Mechanism
– Pre-assembled Processing Model
– smartFAM
• Extend the Shared-Memory MapReduce by
Partitioning

5/7/2012 12

Pre-assembled Processing Modules

• Pre-assembled Processing Modules
– Meet the nature of embedded services
– Reduce Complexity and Cost
– Provide Services
• E.g. Multi-version antivirus service, Pre-process of data-
intensive apps, De-duplication, and etc.
• How to invoke services?

5/7/2012 13

smartFAM

• smartFAM = Smart File Alternation Monitor
– Invokes the pre-assembled processing modules or
functions by monitoring the changes of the system
log file.
• Two Components:
– an inotify function: a Linux system function
– a trigger daemon

5/7/2012 14

Design and Implementation

Active Node

smartFAM
Daemon

Pre-assembled
Modules
inotify
... Host node
2
1
smartFAM Main Program

Daemon
Module Log Data-
Log ﬁles General
intensive
& Result data functions
function

3 inotify
Merge Results

NFS

5/7/2012 15

Extend the Phoenix:
A Shared-memory MapReduce Model

• Extend the Phoenix MapReduce Programming
Model by partitioning and merging
– New API: partition_input
– New Functions:
• partition (provided by the new API)
• merge (Develop by user)

• Example:
– wordcount [data-file][partition-size][]

5/7/2012 16

Pipeline Processing

5/7/2012 17

Evaluation Environment

• Testbed

• Benchmarks
– Word Count
– String Match
– Matrix Multiplication

• Individual Node Performance
• System Performance
5/7/2012 18

Individual Node Performance

Word Count (seconds) String Match (seconds)

1 GB 1.25 GB 1 GB 1.25 GB

w/ Partition 40.60 50.91 17.76 20.61

w/o Partition 85.74 139.54 17.62 21.00

5/7/2012 19

System Evaluation

Matrix-Multiplication and Word-Count (Speedups)
Input Data Size vs Single Machine vs Single-core Active vs McSD w/o Partition

500 MB 1.47 X 2.15 X 0.99 X

750 MB 1.45 X 2.09 X 1.04 X

1 GB 7.62 X 2.14 X 6.07 X

1.25 GB 19.01 X 2.50 X 15.39 X

TConsumptionOfControlSample
Speedup =
TConsumptionOfMcSD
5/7/2012 20

Summary

• It can improve system performance by
offloading data-intensive computation

• McSD is a promising active storage model with
– Pre-assembled processing modules
– Parallel data processing
– Better Evaluation Performance

5/7/2012 21


McSD:
A Smart Disk Model

pp-mpiBlast:

Storage Node
HcDD:

5/7/2012 22

Apply Active Storages to a Cluster

• So far, we know the potential of Active
Storages

• Challenge: How to coordinate active storage
nodes with computing nodes?

• Propose a Pipeline-parallel Processing pattern

5/7/2012 23

Contributions

• Propose a pipeline-parallel processing framework
to “connect” a Active Storage node with
computing nodes.
• Evaluate the framework using both an analytic
model and a real implementation.
• Case Study: Extend an existing bioinformatics
application based on the framework.

5/7/2012 24

Background: Active Storage

Processor
Memory

Mass Storage
Bridge?

Active Storage
Node

SSD SSD Computation

Buff Disks

5/7/2012 25

Background: Bioinformatics App

• BLAST*: Basic Local Alignment Search Tool
– Comparing primary biological sequence
information

• mpiBLAST** is a freely available, open-source,
parallel implementation of NCBI BLAST.
– Format raw data files
– Run a parallel BLAST function
*http://blast.ncbi.nlm.nih.gov/
**http://www.mpiblast.org/
5/7/2012 27

Pipeline-parallel Design

• Offload the raw-data formatting task to where
data stores.
• Intra-application Pipeline-parallel Processing
by “partition” and “merge”.
• pp-mpiBlast, a case study.

5/7/2012 28

Pipelining Workflow

Active Storage Node Computing Nodes
Intermediate Sub-output
Partition 1
1 1

Raw 2 2
Inter- 2
Output
Input Formart DB mediat Formart DB Output
File
File … es … …
Partition Intermediate Sub-output
n n n

n 1
Partition FormatDB mpiBlast Merge
(n-1) times
(n-1) times
5/7/2012 29

Analytic Model

• Three Critical Measures
Tresponse = Tactive + Tcompute
1
Throughput =
max(Tactive ,Tcompute )
Tsequence n ´ (Tactive + Tcompute )
Speedup = =
Tpipelined Tactive + (n -1) ´ max(Tactive ,Tcompute ) + Tcompute
n
=
Throughput
1+ (n -1) ´
Tresponse

5/7/2012 30

Evaluation Environment

Computing Nodes Configuration Active Storage Configuration
CPU Intel XEON X3430 Intel Core 2 Q9400
Memory 2 GB DDR3 (PC3-10600)
OS Ubuntu 9.04 Jaunty Jackalope 32bit Version
Kernel 2.6.28-15-generic
Network Gigabit LAN

Our Testbed Opposite Testbeds
“Pipeline-parallel” “12-node Cluster” “13-node Cluster”
12 Computing Nodes 12 Computing Nodes 13 Computing Nodes
1 Active Storage Node 1 Storage Node 1 Storage Node

5/7/2012 31

Pipeline-parallel Design

Results: Compared With 12-node System

Results: Compared With 13-node System
5/7/2012 32

Speedups Trends: Partition Size

5/7/2012 33

Summary

• We proposed a pipeline-parallel processing
mechanism to apply an Active Storage Node.

• As a case study, we extended a classic
bioinformatics application based on the
pipeline-parallel style.

5/7/2012 34


McSD:
A Smart Disk Model

pp-mpiBlast:

Storage Node
HcDD:

5/7/2012 35

What’s Hybrid?

A Hybrid Combination of a Gas Power
Engine and a Electronic Engine Efficiency

5/7/2012 36

Hybrid Disk Drives

• A Hybrid Combination of Two Types of Storage
Devices: HDD and SSD
– HDD: Magnetic Hard Disk
– Solid State Disk: Built by NAND-based flash memory.

What are their roles?

5/7/2012 37

Motivation

• In a hybrid storage system, using SSDs as the
buffer can boost the performance.
WordCount on Intel Core2 Duo E8400 (seconds)

• However, SSDs suffer Input Data Size issues.
Storage Buffer
reliability
500 MB 750 MB 1 GB 1.25 GB

HDD HDD 21.51 38.30 505.25 1294.64

HDD SD
S 19.89 36.41 85.74 139.54

5/7/2012 38

Limitations Related to SSDs

• Flash Memory:
– Each Block consists 32 or 64 or128 pages.
– Each Page is typically 512 or 2,048 or 4,096 bytes.
• “Erase-before-write” at block level.
• Lifespan is 10,000 Program/Erase cycles.
– E.g., *The lifespan of an 80 GB MLC SSD can only
last 106 days, if the write rates is 30 MB/s.
• Rethink about their roles?
*Based on the SSD lifespan calculator provided by Virident.com
5/7/2012 39

Contributions

• Hybrid Combination of HDD and SSD disks

• De-duplication Service using HDDs as a Write Buffer

• Internal-parallel Processing in SSD

• Simulation of the Whole System For Evaluation

5/7/2012 40

Hybrid Disk Configuration

De-duplication
Data of Write Requests

HDD
I/O Dedicated
Requests data Processor
Deduplicated data

Read Requests Pre-processing
Pre-processed Data
Data
SSD

5/7/2012 41

HcDD Architecture

5/7/2012 42

Deduplication Design

5/7/2012 43

List #0
List #1
List #2
List #3
List #4
List #5
List #6
List #7

5/7/2012
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...

SDRAM Cache
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Req 17 Req 18 Req 19 Req 20 Req 21 Req 22 Req 23 Req 24

#0
#1
#2
#5
#6
#7

#3
#4

Package
Package
Package
Package
Package
Package
Package
Package

44
Internal Parallel Processing

Evaluation

5/7/2012 45

Internal Parallelism Evaluation:
Single Node

5/7/2012 46

Single Node: Dedup Ratio

5/7/2012 47

System Performance Evaluation

5/7/2012 48

System Performance Evaluation

5/7/2012 49

Summary

5/7/2012 50

Conclusion

McSD:
A Smart Disk Model

pp-mpiBlast:

Storage Node
HcDD:

5/7/2012 51

Future Work

5/7/2012 52

Many Thanks!
And Questions?

5/7/2012 53

An Active and Hybrid Storage System for Data-intensive Applications

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to An Active and Hybrid Storage System for Data-intensive Applications (20)

More from Xiao Qin (10)

Recently uploaded (20)

An Active and Hybrid Storage System for Data-intensive Applications

Editor's Notes