WekaIO: Making Machine Learning Compute Bound Again

1WekaIO Confidential © 2017 All rights reserved.
Making Machine Learning
Compute Bound Again
Liran Zvibel, Co-founder and CEO
liran@weka.io
@liranzvibel

THE STORAGE
DNA
WekaIO Introduction
WekaIO Matrix is the fastest, most scalable parallel file system for AI and
technical compute workloads that ensures your applications never wait for data.
WHO WE ARE
THE PARTNERS THE ACCOLADES

We are using WekaIO technologies over InfiniBand to address the challenges of
data analytics at extreme scale in life sciences, particle physics, geosciences, and
other fields. That process is still ongoing but to-date we’ve already achieved some
promising results.
Michael Norman, Director of San Diego Supercomputer Center at UCSD

Future-thinking companies like WekaIO, complement our core
principle of accelerating research and discovery. The ability to run
more concurrent high performance genomic workloads will
significantly advance our time to discovery.
Nelson Kick, Manager of HPC Operations

“Where can you find another non-
linear servo-mechanism weighing only
150 pounds and having great
adaptability, that can be produced so
cheaply by completely unskilled
labor?”
-- Albert Scott Crossfield, May 1954
https://quoteinvestigator.com/2016/02/01/computer/#note-12956-2

"within a generation ... the problem of creating
'artificial intelligence' will substantially be solved”
(1967)
Marvin Minsky (1927 – 2016)
Co-Founder of MIT’s AI lab in 1959
The Journey Continues
Artificial Intelligence Demystified
Source: www.AnalyticsVidhya.com

“Self driving vehicles will completely
revolutionize the way we get around in the
future, but also have the potential to literally
save 100k of lives…
Self driving vehicles let us think differently about
road safety… 94% of vehicle accidents are
caused by human error.
Why wouldn’t we do anything we can to bring
self driving vehicles technologies to our roads?”
-- Senator Gary Peters, November 2017
Making the Case for AI

Deep Learning Requires Data

Source: Andrew Ng, VP & Chief Scientist of Baidu; Co-Chairman and Co-
Founder of Coursera; and Adjunct Professor at Stanford University
Scale Drives Progress

What has changed recently?
o Deep Learning algorithms have been
known since the ’80s
o Problem was with the computational
power
– GPUs and other accelerators made deep
learning practical
o “GPUs are the most efficient way to
take a compute bound problem, and
make it IO bound”
– Addison Snell, InterSect360 Research
o We will show you how to make sure
your Machine Learning workloads are
not IO bound!

GPU Progress in One Year
nVidia P100 vs V100
0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
3.0x
3.5x
PyTorch Tensor
Flow
Caffe2 Caffe MXNet
2.2x
2.5x
2.9x 3.0x 3.05x
Speedup
DGX-1 P100 DGX-1 V100

Deep Learning Requirements
o Actually very close to HPC problems…
o Store a vast amount of data
– Effectively “stage” working set back on fast storage, for efficient
access
o High bandwidth, low latency
o Very good metadata performance – traverse files quickly
o Very high single host performance
o Support multiprotocol (S3, HDFS, SMB, NFS)

Parallel FS over NVMe-oF
WekaIO Matrix Shared File System
VM
Hypervisor – KVM, VMWare
VM VM App
App App App
Fully Coherent POSIX File System That Delivers Local File System Performance
Distributed Coding, More Resilient at Scale, Fast Rebuilds, End-to-End DP
Instantaneous Snapshots, Clones, Tiering to S3, Partial File Rehydration
InfiniBand or Ethernet, Hyperconverged or Dedicated Storage Server

Why parallel FS over NVMe-oF is needed
o Local copy architectures (e.g. Hadoop, or caching solutions) were developed when
1GbitE and HDDs were standard
o Modern networks on 100Gbit IB or Ethernet are 10x faster than SSD
o It is much easier to create distributed algorithms when locality is not important
o With right networking stack, shared storage is faster than local storage
0 20 40 60 80 100 120
100Gbit (EDR)
10Gbit (SDR)
SSD Write
SSD Read
Time it takes to Complete a 4KB Page Move
Microseconds

Data Lake
S3
MatrixFS
Matrix Client
WekaIO Storage Solution for GPU Clusters
Ethernet or InfiniBand Network
Storage Servers
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
APP APP
GPU GPU
Unified Namespace

Data Lake
S3
WekaIO Can Even Run in the GPU Cluster
Unified Namespace
Ethernet or InfiniBand Network
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU
APP
APP
APP
APP
APP
APP
GPU
SSD
APP APP
GPU GPU

AI Requires an Appropriate IA
INSTRUMENT
INGEST
CLEAN
TAG
EXPLORE
TRAIN
VALIDATE
S3
SMB
NFS
CPU Based
HDFS Interface
ARCHIVE &
CATALOG
GPU Based
Server Cluster
Growing Data
Archive
WekaIO
CPU
GPU

Aggregate low latency operations – high throughput
[root@us2 imagenet]# time find . | wc -l
14092659
real 3m15.444s
user 0m8.333s
sys 0m4.397s
[root@us2 imagenet]# time ls|wc -l
21851
real 0m0.643s
user 0m0.084s
sys 0m0.005s
[root@us2 imagenet]# time ~/bin/fd . | wc -l
14092658
real 0m51.199s
user 1m44.484s
sys 0m46.463s
[https://github.com/sharkdp/fd ]

Aggregate low latency operations – high throughput
o Higher performance than a local file system over NVMe devices
– caching is useless!
– Much lower latency for 1MB files, similar latency even for 4k IOs
o Fill line (100GbEEDR) with reads of small file (can fill line with 4k IO)
o Cluster scales as needed, can support 100s of GPU servers working
concurrently

o Problem: Could not achieve the bandwidth
required to keep GPU cluster saturated
o Pain Point: Wasted cycle time ($$$$) on
very expensive GPU clusters.
o Test Platform: 10 Node HPE DL360 vs.
local disk and leading All-Flash NAS
o Result:
– WekaIO – over 2x faster than Local Disk
– WekaIO – 6x faster than All-flash NAS
WekaIO is over 2x Faster than Local Disk
GPU Client
Actual measured data at an autonomous vehicle training installation
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
WekaIO Local Disk SSD All Flash NAS
MB/sec
1MB Read Performance (Single GPU Client)

Comparing WekaIO and GPFS over NVMe
0
0.5
1
1.5
2
2.5
3
60 120 180 240 300 360 420 480 540 600 660 720 780 840 900 960 1020 1080 1140 1200
SPEC 2014 Results – Comparing Netapp FAS and IBM Spectrum Scale 5.0
IBM Spectrum Scale NetApp WekaIO
Latency(milisecond)
WekaIO does 2x
the workload of IBM
Spectrum Scale on
standard Whitebox
Servers

Tiering
o HDD integration provides optimal
economic model
o Parallel FS algorithms over the HDDs
similar to traditional solutions
o Large files are chopped down into small
objects to achieve parallelism
– Can rehydrate and modify partial files
o Stage next training set to NVMe while
current training run works – make sure
GPUs are always utilized
o Data demoted to object storage remains
on SSD as cache until space is required
COLD
DATA

Fully Distributed Snapshots
o 4K granularity
o Instantaneous and no impact on performance
o Supports clones (writable snapshots)
o Support file system wide or file-based snaps
o Redirect-on-write based snaps

Snap-to-obj DR/Backup/burst Functionality
Follow-on snapshot will be saved in differential manner
Enabled use cases:
• Pause/resume : Allows a cluster to be shutdown when not needed,
supporting cloud elasticity
• Backup : the original cluster is not needed in order to access the data
• DR : enabled via geo-replicated object storage, or tiering to the cloud
• Cloud bursting : launch a cluster in the cloud based on a snap saved
on the object storage
The ability to
coherently save a
complete snapshot to
the object storage
Rehydrated
cluster may be
of different size
The original
cluster can now
shut off and the
data is safe

Snapshot File System for Infrastructure Elasticity
Amazon
Cloud
Matrix On-Prem
Matrix Global Namespace
Matrix Global Namespace
Matrix
Cluster

Cloud Bursting
o EC2 cluster can be formed based on S3 snap data
o The original on-premises cluster can continue running and
take snaps
o The EC2 cluster can run concurrently and take snaps
o Each snap that is pushed to S3 can be linked back to the
other system
o The data is viewed via the namespace using the .snapshots
directory and data can be merged
o A hybrid-cloud solution that is “sticky” to the public cloud.
Other solutions require network transfers that are transient
o The “data gravity” is in AWS rather than on-prem. A follow up
burst is “cheaper”

Simple, Intuitive GUI

Simplifying Machine Learning Workflows
APP APP
GPU GPU
APP APP
GPU GPU
o Full line bandwidth to a single GPU
– GPUs are kept fully utilized
– Faster deep learning Wall time
– Or same results can be obtained with less GPUs
EDR
o No need to copy data to local disk on GPU server
– Higher performance than local disk due to parallel access
– Faster time to first epoch
o Centralized storage for easy management
o Run in AWS for on-demand access to P3 GPUs
o Burst to AWS for elastic scaling of GPU resources
>>>>>>>
On-premises AWS Cloud

Scalable High Performance Storage Infrastructure
Easy to manage
• Single namespace with easy
data sharing
• Simple intuitive GUI
• Installs in minutes
• Dynamic capacity scaling
• Easy performance scaling
• Multi-protocol access
- NFS/SMB/HDFS/REST(S3)
• Cloud analytics for easy
maintenance and control
• Policy driven auto. tiering
Flexible
• Supports Ethernet or
InfiniBand
• Public, private or hybrid
Cloud deployment
• Annual subscription with no
maintenance contracts
• Flexible configurations
- Hyperconverged
- Dedicated servers
• Data lake to lower cost
HDDs via S3, Swift or NFS
on -prem or public cloud
Feature Rich
• Highest Performance NVMe
- 150 usec 4k latency
- Up to 6GB/sec per client
- Lower latency than a local
FS over NVMe under load
• High Performance SATA
- Leverage existing infra
• Configurable resiliency N+2
to N+4
- Multiple concurrent failures
- scales to 1000s of servers
• Native-perf. 4k granular
snapshots and clones
• Cloud bursting and DR

WekaIO over IB – Best Storage for Deep LearningGPU
INSTRUMENT
INGEST
CLEAN
TAG
EXPLORE
TRAIN
VALIDATE
S3
SMB
NFS
CPU Based
HDFS
Interface
ARCHIVE
&
CATALOG
GPU Based
Server
Cluster
Growing
Data
Archive
WekaIO
CPU
GPU
Best Workflow support for AI workloads
• Natively support Infiniband – default interconnect for GPU
servers and workloads (DGX-1, HPE 6500)
• Highest Single Client Performance – saturate EDR line,
lowest latency 4k IO. Low latency, high throughput
• Your GPU servers won’t idle anymore!
• Run GPU-Direct concurrent to storage, single network
• Scales to support any number of GPU servers sharing the
FS
• Single solution from ingest to archive. Native ability to
stage upcoming working set to SSD for next training job
• Parallel File System with NVMe-oF technology – IB
latency insignificant compared to SSD latency
• Shared file system with metadata performance of local FS
0 50 100 150
100Gbit (EDR)
10Gbit (SDR)
SSD Write
SSD Read
Microseconds

Software Architecture
o Runs inside LXC container for
isolation
o SR-IOV to run network stack and
NVMe in user space
o Provides POSIX VFS through
lockless queues to WekaIO driver
o I/O stack bypasses kernel
o Scheduling and memory
management also bypass kernel
o Metadata split into many Buckets –
Buckets quickly migrate è no hot
spots
o Support, bare metal, container &
hypervisor
Clustering
Balancing
Failure Det.
& Comm.
IP Takeover
Applicatio
nApplicatio
nApplicatio
nApplication
Frontend
SSD Agent
H/W
User
Space
KernelWeka DriverTCP/IP Stack
Distributed
FS “Bucket”Distributed
FS “Bucket”
Data
Placement
FS
Metadata
Tiering
Backend
Networking
NFS
S3
SMB
HDFS

Snapshot File System as DR Strategy
Amazon
Region A
Matrix On-Prem
Amazon
Region B
File
System

WekaIO: Making Machine Learning Compute Bound Again

More Related Content

What's hot (20)

Similar to WekaIO: Making Machine Learning Compute Bound Again (20)

More from inside-BigData.com (20)

Recently uploaded (20)

WekaIO: Making Machine Learning Compute Bound Again