Tacc Infinite Memory Engine

Site-Wide Storage Use Case and Early
User Experience with Infinite Memory
Engine
Tommy Minyard
Texas Advanced Computing Center
DDN User Group Meeting
November 17, 2014

TACC Mission & Strategy
The mission of the Texas Advanced Computing Center is to enable
scientific discovery and enhance society through the application of
advanced computing technologies.
To accomplish this mission, TACC:
– Evaluates, acquires & operates
advanced computing systems
– Provides training, consulting, and
documentation to users
– Collaborates with researchers to
apply advanced computing techniques
– Conducts research & development to
produce new computational technologies
Resources &
Services
Research &
Development

TACC Storage Needs
• Cluster specific storage
– High performance (tens to hundreds GB/s bandwidth)
– Large-capacity (~2TBs per Teraflop), purged frequently
– Very scalable to thousands of clients
• Center-wide persistent storage
– Global filesystem available on all systems
– Very large capacity, quota enabled
– Moderate performance, very reliable, high availability
• Permanent archival storage
– Maximum capacity, tens of PBs of capacity
– Slow performance, tape-based offline storage with spinning
storage cache

History of DDN at TACC
• 2006 – Lonestar 3 with DDN S2A9500
controllers and 120TB of disk
• 2008 – Corral with DDN S2A9900 controller
and 1.2PB of disk
• 2010 – Lonestar 4 with DDN SFA10000
controllers with 1.8PB of disk
• 2011 – Corral upgrade with DDN SFA10000
controllers and 5PB of disk

Global Filesystem Requirements
• User requests for persistent storage available
on all production systems
– Corral limited to UT System users only
• RFP issued for storage system capable of:
– At least 20PB of usable storage
– At least 100GB/s aggregate bandwidth
– High availability and reliability
• DDN proposal selected for project

Stockyard: Design and Setup
• A Lustre 2.4.2 based global files system, with
scalability for future upgrades
• Scalable Unit (SU): 16 OSS nodes providing
access to 168 OST’s of RAID6 arrays from
two SFA12k couplets, corresponding to 5PB
capacity and 25+ GB/s throughput per SU
• Four SU’s provide 25PB raw with >100GB/s
• 16 initial LNET routers for external mounts

Scalable Unit (One server rack with
two DDN SFA12k couplet racks)

Scalable Unit Hardware Details
• SFA12k Rack: 50U rack with 8x L6-30p
• SFA12k couplet with 16 IB FDR ports (direct
attachment to the 16 OSS servers)
• 84 slot SS8460 drive enclosures (10 per rack,
20 enclosures per SU)
• 4TB 7200RPM NL-SAS drives

Stockyard: Capabilities and Features
• 20PB usable capacity with 100+ GB/s
aggregate bandwidth
• Client systems can add LNET routers to
connect to the Stockyard core IB switches or
connect to the built-in LNET routers using
either IB or TCP. (FDR14 or 10GigE)
• Automatic failover with Corosync and
Pacemaker

Stockyard: Performance
• Local storage testing surpassed 100GB/s
• Initial bandwidth from Stampede compute
clients using Lustre 2.1.6 and 16 routers:
65GB/s with 256 clients (IOR, posix, fpp, with
8 tasks per node)
• After upgrade of Stampede clients to Lustre
2.5.2: 75GB/s
• Added 8 LNET routers to connect Maverick
visualization system: 38GB/s

Failover Testing
• OSS failover test setup and results
• Procedure:
– Identify the OST’s for the test pair
– Initiate write processes targeted to the particular OST’s, each of
about 67GB in size so that it does not finish before the failover
– Interrupt one of the OSS server with shutdown using ipmitool
– Record the individual write process outputs as well as server and
client side Lustre messages
– Compare and confirm the recovery and operation of the failover
pair with all OST’s
• All I/O completes within 2 minutes of failover

Failover Testing (cont’d)
• Similarly for MDS pair: same sequence of interrupted I/O
and collection of Lustre messages on both servers and clients,
client side log shows the recovery.
– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:
1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay:
[sent 1381348698/real 0] req@ffff88180cfcd000 x1448277242593528/t0(0) o250-
>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0
to 1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:
1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message
– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at
MGC192.168.200.10@o2ib100_1) after server handle changed from
0xb9929a99b6d258cd to 0x6282da9e97a66646
– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:
Connection restored to MGS (at 192.168.200.11@o2ib100)

Infinite Memory Engine Evaluation
• As with most HPC filesystems, rarely sustain
full bandwidth capability of filesystem
• Really need the capacity of lots of disk
spindles and handle the bursts of I/O activity
• Stampede used to evaluate IME at scale
using old /work filesystem for backend
storage

IME Evaluation Hardware
• Old Stampede /work filesystem hardware
– Eight storage servers, 64 drives each
– Lustre 2.5.2 server version
– Capable of 24GB/s peak performance
– At ~50% of capacity from previous use
• IME hardware configuration
– Eight DDN IME servers fully populated with SSDs
– Two FDR IB connections per server
– 80GB/s peak performance

Initial IME Evaluation
• First testing showed bottlenecks with write
performance reaching only 40GB/s
• IB topology identified as culprit as 12 of the IB
ports connected to a single IB switch with
only 8 uplinks to core switches
• Redistributing IME IB links to switches without
oversubscription resolved bottleneck
• Performance increased to almost 80GB/s
after moving IB connections

HACC_IO @ TACC
Cosmology Kernel
COMPUTE
CLUSTER
BURST
BUFFER
17 GB/s!
Lustre PFS
80 GB/s!
HACC_IO Cosmology!
Particles
per
Process
Num.
Clients
IME Writes
(GB/s)
IME Reads
(GB/s)
PFS
Writes
(GB/s)
PFS
Read
(GB/s)
34M 128 62.8 63.7 2.2 9.8
34M 256 68.9 71.2 4.6 6.5
34M 512 73.2 71.4 9.1 7.5
34M 1024 63.2 70.8 17.3 8.2
IME
3.7x-28x 6.5x-11x
Acceleration

S3D @ TACC
Turbulent Combustion Kernel
COMPUTE
CLUSTER
BURST
BUFFER
3.3 GB/s!
Lustre PFS
60.8 GB/s!
S3D Turbulent Combustion!
Processes X Y Z IME
Write
(GB/s)
PFS
Write
(GB/s)
Acceleration
16 1024 1024 128 8.2 1.2 6.8x
32 1024 2048 128 14.0 1.5 9.3x
64 1024 4096 128 22.3 1.5 14.9x
128 1024 8192 128 31.8 3.0 10.6x
256 1024 16384 128 44.7 2.6 17.2x
512 1024 32768 128 53.5 2.4 22.3x
1024 1024 65536 128 60.8 3.3 18.4x

MADBench @ TACC
COMPUTE
CLUSTER
BURST
BUFFER
8.7 GB/s!
Lustre PFS
70+ GB/s!
Phase IME Read
(GB/s)
IME Write
(GB/s)
PFS
Read
(GB/s)
PFS
Write
(GB/s)
S 71.9 7.1
W 74.6 75.5 7.8 8.7
C 74.7 11.9
IME
6.2x-9.6x 8.7x-10.1x
Accel.
Application Configuration: NP = 3136, #Bins=8, #pix = 265K !

Summary
• Storage capacity and performance needs
growing at exponential rate
• High-performance and reliable filesystems
critical for HPC productivity
• Current best solution for cost, performance
and scalability is Lustre-based filesystem
• Initial IME testing demonstrated scalability
and capability on large scale system

Tacc Infinite Memory Engine

More Related Content

What's hot (20)

Similar to Tacc Infinite Memory Engine (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Tacc Infinite Memory Engine