Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC – Exploi4ng HPC Technologies
for Accelera4ng Big Data Processing
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
h<p://www.cse.ohio-state.edu/~panda
Keynote Talk at HPCAC-Switzerland (Mar 2016)
by

HPCAC-Switzerland (Mar ‘16) 2 Network Based Compu4ng Laboratory
•  Big Data has become the one of the most
important elements of business analyFcs
•  Provides groundbreaking opportuniFes for
enterprise informaFon management and
decision making
•  The amount of data is exploding; companies
are capturing and digiFzing more informaFon
than ever
•  The rate of informaFon growth appears to be
exceeding Moore’s Law
Introduc4on to Big Data Applica4ons and Analy4cs

•  Commonly accepted 3V’s of Big Data
•  Volume, Velocity, Variety
Michael Stonebraker: Big Data Means at Least Three Diﬀerent Things, hWp://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
•  4/5V’s of Big Data – 3V + *Veracity, *Value
4V Characteris4cs of Big Data
Courtesy: hWp://api.ning.com/ﬁles/tRHkwQN7s-
Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-
JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg

•  Webpages (content, graph)
•  Clicks (ad, page, social)
•  Users (OpenID, FB Connect, etc.)
•  e-mails (Hotmail, Y!Mail, Gmail, etc.)
•  Photos, Movies (Flickr, YouTube, Video, etc.)
•  Cookies / tracking info (see Ghostery)
•  Installed apps (Android market, App Store, etc.)
•  LocaFon (LaFtude, Loopt, Foursquared, Google Now, etc.)
•  User generated content (Wikipedia & co, etc.)
•  Ads (display, text, DoubleClick, Yahoo, etc.)
•  Comments (Discuss, Facebook, etc.)
•  Reviews (Yelp, Y!Local, etc.)
•  Social connecFons (LinkedIn, Facebook, etc.)
•  Purchase decisions (Netﬂix, Amazon, etc.)
•  Instant Messages (YIM, Skype, Gtalk, etc.)
•  Search terms (Google, Bing, etc.)
•  News arFcles (BBC, NYTimes, Y!News, etc.)
•  Blog posts (Tumblr, Wordpress, etc.)
•  Microblogs (Twi<er, Jaiku, Meme, etc.)
•  Link sharing (Facebook, Delicious, Buzz, etc.)
Data Genera4on in Internet Services and Applica4ons
Number of Apps in the Apple App Store, Android Market, Blackberry,
and Windows Phone (2013)
•  Android Market: <1200K
•  Apple App Store: ~1000K
Courtesy: hWp://dazeinfo.com/2014/07/10/apple-inc-aapl-ios-google-inc-goog-
android-growth-mobile-ecosystem-2014/

Velocity of Big Data – How Much Data Is Generated Every Minute on the Internet?
The global Internet popula4on grew 18.5% from 2013 to 2015 and now represents
3.2 Billion People.
Courtesy: hWps://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

•  ScienFﬁc Data Management, Analysis, and VisualizaFon
•  ApplicaFons examples
–  Climate modeling
–  CombusFon
–  Fusion
–  Astrophysics
–  BioinformaFcs
•  Data Intensive Tasks
–  Runs large-scale simulaFons on supercomputers
–  Dump data on parallel storage systems
–  Collect experimental / observaFonal data
–  Move experimental / observaFonal data to analysis sites
–  Visual analyFcs – help understand data visually
Not Only in Internet Services - Big Data in Scien4ﬁc Domains

•  Hadoop: h<p://hadoop.apache.org
–  The most popular framework for Big Data AnalyFcs
–  HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.
•  Spark: h<p://spark-project.org
–  Provides primiFves for in-memory cluster compuFng; Jobs can load data into memory and query it repeatedly
•  Storm: h<p://storm-project.net
–  A distributed real-Fme computaFon system for real-Fme analyFcs, online machine learning, conFnuous
computaFon, etc.
•  S4: h<p://incubator.apache.org/s4
–  A distributed system for processing conFnuous unbounded streams of data
•  GraphLab: h<p://graphlab.org
–  Consists of a core C++ GraphLab API and a collecFon of high-performance machine learning and data mining
toolkits built on top of the GraphLab API.
•  Web 2.0: RDBMS + Memcached (h<p://memcached.org)
–  Memcached: A high-performance, distributed memory object caching systems
Typical Solu4ons or Architectures for Big Data Analy4cs

Big Data Processing with Hadoop Components
•  Major components included in this
tutorial:
–  MapReduce (Batch)
–  HBase (Query)
–  HDFS (Storage)
–  RPC (Inter-process communicaFon)
•  Underlying Hadoop Distributed File
System (HDFS) used by both
MapReduce and HBase
•  Model scales but high amount of
communicaFon during intermediate
phases can be further opFmized
HDFS
MapReduce
Hadoop Framework
User Applica4ons
HBase
Hadoop Common (RPC)

Worker
Worker
Worker
Worker
Master
HDFS
Driver
Zookeeper
Worker
SparkContext
•  An in-memory data-processing
framework
–  IteraFve machine learning jobs
–  InteracFve data analyFcs
–  Scala based ImplementaFon
–  Standalone, YARN, Mesos
•  Scalable and communicaFon intensive
–  Wide dependencies between Resilient
Distributed Datasets (RDDs)
–  MapReduce-like shuﬄe operaFons to
reparFFon RDDs
–  Sockets based communicaFon
Spark Architecture Overview
hWp://spark.apache.org

Memcached Architecture
•  Distributed Caching Layer
–  Allows to aggregate spare memory from mulFple nodes
–  General purpose
•  Typically used to cache database queries, results of API calls
•  Scalable model, but typical usage very network intensive
Main
memory
CPUs
SSD HDD
High Performance
Networks
... ...
...
Main
memory
CPUs
SSD HDD
Main
memory
CPUs
SSD HDD
Main
memory
CPUs
SSD HDD
Main
memory
CPUs
SSD HDD

•  SubstanFal impact on designing and uFlizing data management and processing systems in mulFple Fers
–  Front-end data accessing and serving (Online)
•  Memcached + DB (e.g. MySQL), HBase
–  Back-end data analyFcs (Oﬄine)
•  HDFS, MapReduce, Spark
Data Management and Processing on Modern Clusters
Internet
Front-end Tier Back-end Tier
Web
ServerWeb
ServerWeb
Server
Memcached
+ DB (MySQL)Memcached
+ DB (MySQL)Memcached
+ DB (MySQL)
NoSQL DB
(HBase)NoSQL DB
(HBase)NoSQL DB
(HBase)
HDFS
MapReduce Spark
Data Analytics Apps/Jobs
Data Accessing
and Serving

•  Introduced in Oct 2000
•  High Performance Data Transfer
–  Interprocessor communicaFon and I/O
–  Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and
low CPU uFlizaFon (5-10%)
•  MulFple OperaFons
–  Send/Recv
–  RDMA Read/Write
–  Atomic OperaFons (very unique)
•  high performance and scalable implementaFons of distributed locks, semaphores, collecFve
communicaFon operaFons
•  Leading to big changes in designing
–  HPC clusters
–  File systems
–  Cloud compuFng systems
–  Grid compuFng systems
Open Standard InﬁniBand Networking Technology

Interconnects and Protocols in OpenFabrics Stack
Kernel
Space
Applica4on /
Middleware
Verbs
Ethernet
Adapter
Ethernet
Switch
Ethernet
Driver
TCP/IP
1/10/40/100
GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet
Switch
Hardware
Offload
TCP/IP
10/40 GigE-
TOE
InfiniBand
Adapter
InfiniBand
Switch
User
Space
RSockets
RSockets
iWARP
Adapter
Ethernet
Switch
TCP/IP
User
Space
iWARP
RoCE
Adapter
Ethernet
Switch
RDMA
User
Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User
Space
IB Na4ve
Sockets
Applica4on /
Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP

How Can HPC Clusters with High-Performance Interconnect and Storage
Architectures Benefit Big Data Applica4ons?
Bring HPC and Big Data processing into a
“convergent trajectory”!
What are the major
bo<lenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?
Can the bo<lenecks be
alleviated with new
designs by taking
advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applicaFons?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluaFng the
performance of Big
Data middleware on
HPC clusters?

Designing Communica4on and I/O Libraries for Big Data Systems:
Challenges
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InﬁniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applica4ons
Commodity Compu4ng System
Architectures
(Mul4- and Many-core
architectures and accelerators)
Other Protocols?
Communica4on and I/O Library
Point-to-Point
Communica4on
QoS
Threaded Models
and Synchroniza4on
Fault-Tolerance I/O and File Systems
Virtualiza4on
Benchmarks
Upper level
Changes?

•  Sockets not designed for high-performance
–  Stream semanFcs owen mismatch for upper layers
–  Zero-copy not available for non-blocking sockets
Can Big Data Processing Systems be Designed with High-
Performance Networks and Protocols?
Current Design
Applica4on
Sockets
1/10/40/100 GigE
Network
Our Approach
Applica4on
OSU Design
10/40/100 GigE or
InﬁniBand
Verbs Interface

•  RDMA for Apache Spark
•  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
–  Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distribuFons
•  RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
•  RDMA for Memcached (RDMA-Memcached)
•  OSU HiBD-Benchmarks (OHB)
–  HDFS and Memcached Micro-benchmarks
•  hWp://hibd.cse.ohio-state.edu
•  Users Base: 155 organizaFons from 20 countries
•  More than 15,500 downloads from the project site
•  RDMA for Apache HBase and Impala (upcoming)
The High-Performance Big Data (HiBD) Project
Available for InﬁniBand and RoCE

•  High-Performance Design of Hadoop over RDMA-enabled Interconnects
–  High performance RDMA-enhanced design with naFve InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and
RPC components
–  Enhanced HDFS with in-memory and heterogeneous storage
–  High performance design of MapReduce over Lustre
–  Plugin-based architecture supporFng RDMA-based designs for Apache Hadoop, CDH and HDP
–  Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (naFve
InfiniBand, RoCE, and IPoIB)
•  Current release: 0.9.9
–  Based on Apache Hadoop 2.7.1
–  Compliant with Apache Hadoop 2.7.1, HDP 2.3.0.0 and CDH 5.6.0 APIs and applicaFons
–  Tested with
•  Mellanox InfiniBand adapters (DDR, QDR and FDR)
•  RoCE support with Mellanox adapters
•  Various mulF-core pla|orms
•  Different file systems with disks and SSDs and Lustre
–  hWp://hibd.cse.ohio-state.edu
RDMA for Apache Hadoop 2.x Distribu4on

•  High-Performance Design of Spark over RDMA-enabled Interconnects
–  High performance RDMA-enhanced design with naFve InfiniBand and RoCE support at the verbs-level
for Spark
–  RDMA-based data shuffle and SEDA-based shuffle architecture
–  Non-blocking and chunk-based data transfer
–  Easily configurable for different protocols (naFve InfiniBand, RoCE, and IPoIB)
–  Based on Apache Spark 1.5.1
–  Tested with
•  RAM disks, SSDs, and HDD
RDMA for Apache Spark Distribu4on

•  High-Performance Design of Memcached over RDMA-enabled Interconnects
–  High performance RDMA-enhanced design with naFve InfiniBand and RoCE support at the verbs-level for
Memcached and libMemcached components
–  High performance design of SSD-Assisted Hybrid Memory
–  Easily configurable for naFve InfiniBand, RoCE and the tradiFonal sockets-based support (Ethernet and
InfiniBand with IPoIB)
–  Based on Memcached 1.4.24 and libMemcached 1.0.18
–  Compliant with libMemcached APIs and applicaFons
–  Tested with
•  SSD
RDMA for Memcached Distribu4on

•  Micro-benchmarks for Hadoop Distributed File System (HDFS)
–  SequenFal Write Latency (SWL) Benchmark, SequenFal Read Latency (SRL)
Benchmark, Random Read Latency (RRL) Benchmark, SequenFal Write Throughput
(SWT) Benchmark, SequenFal Read Throughput (SRT) Benchmark
–  Support benchmarking of
•  Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Pla|orm (HDP) HDFS, Cloudera
DistribuFon of Hadoop (CDH) HDFS
•  Micro-benchmarks for Memcached
–  Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark
•  Current release: 0.8
•  hWp://hibd.cse.ohio-state.edu
OSU HiBD Micro-Benchmark (OHB) Suite – HDFS & Memcached

•  HHH: Heterogeneous storage devices with hybrid replicaFon schemes are supported in this mode of operaFon to have be<er fault-tolerance as well
as performance. This mode is enabled by default in the package.
•  HHH-M: A high-performance in-memory based setup has been introduced in this package that can be uFlized to perform all I/O operaFons in-
memory and obtain as much performance benefit as possible.
•  HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
•  MapReduce over Lustre, with/without local disks: Besides, HDFS based soluFons, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
•  Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x

•  RDMA-based Designs and Performance EvaluaFon
–  HDFS
–  MapReduce
–  RPC
–  HBase
–  Spark
–  Memcached (Basic, Hybrid and Non-blocking APIs)
–  HDFS + Memcached-based Burst Buﬀer
–  OSU HiBD Benchmarks (OHB)
Accelera4on Case Studies and Performance Evalua4on

•  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface
•  JNI Layer bridges Java based HDFS with communicaFon library wri<en in naFve code

Design Overview of HDFS with RDMA
HDFS
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Applica4ons
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Na4ve Interface (JNI)
Write Others

OSU Design

•  Design Features
–  RDMA-based HDFS write
–  RDMA-based HDFS
replicaFon
–  Parallel replicaFon
support
–  On-demand connecFon
setup
–  InﬁniBand/RoCE support
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS
over InﬁniBand , Supercompu4ng (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014

Triple-H

Heterogeneous Storage

–  Three modes
•  Default (HHH)
•  In-Memory (HHH-M)
•  Lustre-Integrated (HHH-L)
–  Policies to eﬃciently uFlize the heterogeneous
storage devices
•  RAM, SSD, HDD, Lustre
–  EvicFon/PromoFon based on data usage
pa<ern
–  Hybrid ReplicaFon
–  Lustre-Integrated mode:
•  Lustre-based fault-tolerance
Enhanced HDFS with In-Memory and Heterogeneous Storage
Hybrid Replica4on
Data Placement Policies
Evic4on/Promo4on
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters
with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applica4ons

Design Overview of MapReduce with RDMA
MapReduce
Verbs
OSU Design
Applica4ons
1/10/40/100 GigE, IPoIB
Network
Job
Tracker
Task
Tracker
Map
Reduce
•  JNI Layer bridges Java based MapReduce with communicaFon library wri<en in naFve code
–  RDMA-based shuffle
–  Prefetching and caching map output
–  Efficient Shuffle Algorithms
–  In-memory merge
–  On-demand Shuffle Adjustment
–  Advanced overlapping
•  map, shuffle, and merge
•  shuffle, merge, and reduce
–  On-demand connecFon setup
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in
MapReduce over High Performance Interconnects, ICS, June 2014

•  A hybrid approach to achieve maximum
possible overlapping in MapReduce across
all phases compared to other approaches
–  Efficient Shuffle Algorithms
–  Dynamic and Efficient Switching
–  On-demand Shuffle Adjustment
Advanced Overlapping among Different Phases
Default Architecture
Enhanced Overlapping with In-Memory Merge
Advanced Hybrid Overlapping
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A
Hybrid Approach to Exploit Maximum Overlapping in MapReduce
over High Performance Interconnects, ICS, June 2014

–  JVM-bypassed buffer
management
–  RDMA or send/recv based
adapFve communicaFon
–  Intelligent buffer allocaFon and
adjustment for serializaFon
–  On-demand connecFon setup
Design Overview of Hadoop RPC with RDMA
Hadoop RPC
Verbs
Applica4ons
1/10/40/100 GigE, IPoIB
Network
Our Design Default

OSU Design

•  JNI Layer bridges Java based RPC with communicaFon library wri<en in naFve code
X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with
RDMA over InfiniBand, Int'l Conference on Parallel Processing (ICPP '13), October 2013.

0
50
100
150
200
250
80 100 120
Execu4on Time (s)
Data Size (GB)
IPoIB (FDR)
0
50
100
150
200
250
80 100 120
Execu4on Time (s)
Data Size (GB)
IPoIB (FDR)
Performance Benefits – RandomWriter & TeraGen in TACC-Stampede
Cluster with 32 Nodes with a total of 128 maps
•  RandomWriter
–  3-4x improvement over IPoIB
for 80-120 GB file size
•  TeraGen
–  4-5x improvement over IPoIB
for 80-120 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x

0
100
200
300
400
500
600
700
800
900
80 100 120
Execu4on Time (s)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
100
200
300
400
500
600
80 100 120
Execu4on Time (s)
Data Size (GB)
Performance Beneﬁts – Sort & TeraSort in TACC-Stampede
Cluster with 32 Nodes with a total of
128 maps and 64 reduces
•  Sort with single HDD per node
–  40-52% improvement over IPoIB
for 80-120 GB data
•  TeraSort with single HDD per node
–  42-44% improvement over IPoIB
for 80-120 GB data
Reduced by 52% Reduced by 44%
Cluster with 32 Nodes with a total of
128 maps and 57 reduces

Evalua4on of HHH and HHH-L with Applica4ons
HDFS (FDR) HHH (FDR)
60.24 s 48.3 s
CloudBurst MR-MSPolyGraph
0
200
400
600
800
1000
4 6 8
ExecuFon Time (s)
Concurrent maps per host
HDFS Lustre HHH-L Reduced by 79%

•  MR-MSPolygraph on OSU RI with
1,000 maps
–  HHH-L reduces the execuFon Fme
by 79% over Lustre, 30% over HDFS
•  CloudBurst on TACC Stampede
–  With HHH: 19% improvement over
HDFS

Evalua4on with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)
•  For 200GB TeraGen on 32 nodes
–  Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)
–  Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)
0
20
40
60
80
100
120
140
160
180
8:50 16:100 32:200
Execu4on Time (s)
Cluster Size : Data Size (GB)
IPoIB (QDR) Tachyon OSU-IB (QDR)
0
100
200
300
400
500
600
700
8:50 16:100 32:200
Execu4on Time (s)
Cluster Size : Data Size (GB)
Reduced
by 2.4x
Reduced by 25.2%
TeraGen TeraSort
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characteriza4on and Accelera4on of In-Memory File
Systems for Hadoop and Spark Applica4ons on HPC Clusters, IEEE BigData ’15, October 2015

Intermediate Data Directory
Design Overview of Shuffle Strategies for MapReduce over Lustre
–  Two shuffle approaches
•  Lustre read based shuffle
•  RDMA based shuffle
–  Hybrid shuffle algorithm to take benefit
from both shuffle approaches
–  Dynamically adapts to the be<er
shuffle approach for each shuffle
request based on profiling values for
each Lustre read operaFon
–  In-memory merge and overlapping of
different phases are kept similar to
RDMA-enhanced MapReduce design
Map 1 Map 2 Map 3
Lustre
Reduce 1 Reduce 2
Lustre Read / RDMA
In-memory
merge/sort
reduce
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN
MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015
In-memory
merge/sort
reduce

•  For 500GB Sort in 64 nodes
–  44% improvement over IPoIB (FDR)
Performance Improvement of MapReduce over Lustre on TACC-
Stampede
–  48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
Job Execu4on Time (sec)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can
RDMA-based Approach Beneﬁt?, Euro-Par, August 2014.
•  Local disk is used as the intermediate data directory

–  34% improvement over IPoIB (QDR)
Case Study - Performance Improvement of MapReduce over
Lustre on SDSC-Gordon
•  For 120GB TeraSort in 16 nodes
–  25% improvement over IPoIB (QDR)
•  Lustre is used as the intermediate data directory
0
100
200
300
400
500
600
700
800
900
40 60 80
Data Size (GB)
IPoIB (QDR)
OSU-Lustre-Read (QDR)
OSU-RDMA-IB (QDR)
OSU-Hybrid-IB (QDR)
0
100
200
300
400
500
600
700
800
900
40 80 120
Data Size (GB)

–  HDFS
–  MapReduce
–  RPC
–  HBase
–  Spark

HBase-RDMA Design Overview
•  JNI Layer bridges Java based HBase with communicaFon library wri<en in naFve code
HBase
IB Verbs
OSU-IB Design
Applica4ons
1/10/40/100 GigE, IPoIB
Networks

HBase – YCSB Read-Write Workload
•  HBase Get latency (QDR, 10GigE)
–  64 clients: 2.0 ms; 128 Clients: 3.5 ms
–  42% improvement over IPoIB for 128 clients
•  HBase Put latency (QDR, 10GigE)
–  64 clients: 1.9 ms; 128 Clients: 3.5 ms
–  40% improvement over IPoIB for 128 clients

0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Time (us)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Time (us)
No. of Clients
10GigE
Read Latency Write Latency
OSU-IB (QDR) IPoIB (QDR)
J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H.
Wang, M. Luo, H. Subramoni, Chet Murthy, and
D. K. Panda, High-Performance Design of HBase
with RDMA over InﬁniBand, IPDPS’12

HBase – YCSB Get Latency and Throughput on SDSC-Comet
•  HBase Get average latency (FDR)
–  4 client threads: 38 us
–  59% improvement over IPoIB for 4 client threads
•  HBase Get total throughput
–  4 client threads: 102 Kops/sec
–  2.4x improvement over IPoIB for 4 client threads

Get Latency Get Throughput
0
0.02
0.04
0.06
0.08
0.1
0.12
1 2 3 4
Average Latency (ms)
Number of Client Threads
0
20
40
60
80
100
120
1 2 3 4
Total Throughput (Kops/
sec)
Number of Client Threads
59%
2.4x

–  HDFS
–  MapReduce
–  RPC
–  HBase
–  Spark

–  RDMA based shuffle
–  SEDA-based plugins
–  Dynamic connecFon
management and sharing
–  Non-blocking data transfer
–  Off-JVM-heap buffer
management
Design Overview of Spark with RDMA
Spark Applications
(Scala/Java/Python)
Spark
(Scala/Java)
BlockFetcherIteratorBlockManager
Task TaskTask Task
Java NIO
Shuffle
Server
(default)
Netty
Shuffle
Server
(optional)
RDMA
Shuffle
Server
(plug-in)
Java NIO
Shuffle
Fetcher
(default)
Netty
Shuffle
Fetcher
(optional)
RDMA
Shuffle
Fetcher
(plug-in)
Java Socket
RDMA-based Shuffle Engine
(Java/JNI)
1/10 Gig Ethernet/IPoIB (QDR/FDR)
Network
Native InfiniBand
(QDR/FDR)
•  JNI Layer bridges Scala based Spark with communicaFon library wri<en in naFve code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelera4ng Spark with RDMA for Big Data Processing: Early
Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

•  InﬁniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
•  RDMA-based design for Spark 1.5.1
•  RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
–  SortBy: Total Fme reduced by up to 80% over IPoIB (56Gbps)
–  GroupBy: Total Fme reduced by up to 57% over IPoIB (56Gbps)
Performance Evalua4on on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Time (sec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Time (sec)
Data Size (GB)
IPoIB
RDMA
57% 80%

•  InﬁniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
•  RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
–  Sort: Total Fme reduced by 38% over IPoIB (56Gbps)
–  TeraSort: Total Fme reduced by 15% over IPoIB (56Gbps)
Performance Evalua4on on SDSC Comet – HiBench Sort/TeraSort
64 Worker Nodes, 1536 cores, Sort Total Time 64 Worker Nodes, 1536 cores, TeraSort Total Time
0
50
100
150
200
250
300
350
400
450
64 128 256
Time (sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
64 128 256
Time (sec)
Data Size (GB)
IPoIB
RDMA
15% 38%

•  InﬁniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
•  RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
–  32 nodes/768 cores: Total Fme reduced by 37% over IPoIB (56Gbps)
–  64 nodes/1536 cores: Total Fme reduced by 43% over IPoIB (56Gbps)
Performance Evalua4on on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData GiganFc
Time (sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData GiganFc
Time (sec)
Data Size (GB)
IPoIB
RDMA
43% 37%

–  HDFS
–  MapReduce
–  RPC
–  HBase
–  Spark

•  Server and client perform a negoFaFon protocol
–  Master thread assigns clients to appropriate worker thread
•  Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to
that thread
•  All other Memcached data structures are shared among RDMA and Sockets worker threads
•  Memcached Server can serve both socket and verbs clients simultaneously
•  Memcached applicaFons need not be modiﬁed; uses verbs interface if available
Memcached-RDMA Design
Sockets
Client
RDMA
Client
Master
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread

Shared
Data

Memory
Slabs
Items
…
1
1
2
2

1
10
100
1000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Time (us)
Message Size
OSU-IB (FDR)
IPoIB (FDR)
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4080
Thousands of Transac4ons
per Second (TPS)
No. of Clients
•  Memcached Get latency
–  4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us
–  2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
•  Memcached Throughput (4bytes)
–  4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s
–  Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput
RDMA-Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced
by nearly 20X
2X

•  IllustraFon with Read-Cache-Read access pa<ern using modiﬁed mysqlslap load tesFng
tool
•  Memcached-RDMA can
-  improve query latency by up to 66% over IPoIB (32Gbps)
-  throughput by up to 69% over IPoIB (32Gbps)
Micro-benchmark Evalua4on for OLDP workloads
0
1
2
3
4
5
6
7
8
64 96 128 160 320 400
Latency (sec)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
0
1000
2000
3000
4000
64 96 128 160 320 400
Throughput (Kq/s)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Beneﬁt On-Line Data Processing Workloads
with Memcached and MySQL, ISPASS’15
Reduced by 66%

•  ohb_memlat & ohb_memthr latency & throughput micro-benchmarks
•  Memcached-RDMA can
-  improve query latency by up to 70% over IPoIB (32Gbps)
-  improve throughput by up to 2X over IPoIB (32Gbps)
-  No overhead in using hybrid mode when all data can ﬁt in memory
Performance Beneﬁts of Hybrid Memcached (Memory + SSD) on
SDSC-Gordon
0
2
4
6
8
10
64 128 256 512 1024
Throughput (million trans/sec)
No. of Clients
IPoIB (32Gbps)
RDMA-Mem (32Gbps)
RDMA-Hybrid (32Gbps)
0
100
200
300
400
500
Average latency (us)
Message Size (Bytes)
2X

–  Memcached latency test with Zipf distribuFon, server with 1 GB memory, 32 KB key-value pair size, total
size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory)
–  When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem
–  When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem
Performance Evalua4on on IB FDR + SATA/NVMe SSDs
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-
NVMe
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-
NVMe
Data Fits In Memory Data Does Not Fit In Memory
Latency (us)
slab allocaFon (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty

–  RDMA-Accelerated CommunicaFon for
Memcached Get/Set
–  Hybrid ‘RAM+SSD’ slab management for
higher data retenFon
–  Non-blocking API extensions
•  memcached_(iset/iget/bset/bget/test/wait)
•  Achieve near in-memory speeds while hiding
bo<lenecks of network and SSD I/O
•  Ability to exploit communicaFon/computaFon
overlap
•  OpFonal buffer re-use guarantees
–  AdapFve slab manager with different I/O
schemes for higher throughput.
Accelera4ng Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs
D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with
RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016
BLOCKING
API
NON-BLOCKING
API REQ.
NON-BLOCKING
API REPLY
CLIENT SERVER
HYBRID SLAB MANAGER (RAM+SSD)
RDMA-ENHANCED COMMUNICATION LIBRARY
RDMA-ENHANCED COMMUNICATION LIBRARY
LIBMEMCACHED LIBRARY
Blocking API Flow

Non-Blocking API Flow

–  Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve
•  >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty
•  >2.5x throughput improvement vs. blocking API over default/opFmized RDMA-Hybrid
–  Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x
improvement over IPoIB-Mem
Performance Evalua4on with Non-Blocking Memcached API
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-
i
H-RDMA-Opt-NonB-
b
Average Latency (us)
Miss Penalty (Backend DB Access Overhead)
Client Wait
Server Response
Cache Update
Cache check+Load (Memory and/or SSD read)
Slab AllocaFon (w/ SSD write on Out-of-Mem)
H = Hybrid Memcached over SATA SSD Opt = AdapFve slab manager Block = Default Blocking API
NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee

–  HDFS
–  MapReduce
–  RPC
–  HBase
–  Spark

–  Memcached-based burst-buffer
system
•  Hides latency of parallel file
system access
•  Read from local storage and
Memcached
–  Data locality achieved by wriFng data
to local storage
–  Different approaches of integraFon
with parallel file system to guarantee
fault-tolerance
Accelera4ng I/O Performance of Big Data Analy4cs
through RDMA-based Key-Value Store
Applica4on
I/O Forwarding Module
Map/Reduce Task DataNode
Local Disk

Data Locality Fault-tolerance
Lustre

Memcached-based Burst Buffer System

Evalua4on with PUMA Workloads
Gains on OSU RI with our approach (Mem-bb) on 24 nodes
•  SequenceCount: 34.5% over Lustre, 40% over HDFS
•  RankedInvertedIndex: 27.3% over Lustre, 48.3% over HDFS
•  HistogramRaFng: 17% over Lustre, 7% over HDFS
0
500
1000
1500
2000
2500
3000
3500
4000
4500
SeqCount RankedInvIndex HistoRaFng
Execu4on Time (s)
Workloads
HDFS (32Gbps)
Lustre (32Gbps)
Mem-bb (32Gbps)
48.3%
40%
17%
N. S. Islam, D. Shankar, X. Lu, M.
W. Rahman, and D. K. Panda,
Accelera4ng I/O Performance of
Big Data Analy4cs with RDMA-
based Key-Value Store, ICPP ’15,
September 2015

–  HDFS
–  MapReduce
–  RPC
–  HBase
–  Spark

•  The current benchmarks provide some performance behavior
•  However, do not provide any informaFon to the designer/developer on:
–  What is happening at the lower-layer?
–  Where the benefits are coming from?
–  Which design is leading to benefits or bo<lenecks?
–  Which component in the design needs to be changed and what will be its impact?
–  Can performance gain/loss at the lower-layer be correlated to the performance
gain/loss observed at the upper layer?
Are the Current Benchmarks Sufficient for Big Data?

Big Data Middleware

Programming Models
(Sockets)
Applica4ons
Architectures
Other Protocols?
Point-to-Point
Communica4on
QoS
Threaded Models
and Synchroniza4on
Virtualiza4on
Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
Current
Benchmarks
No Benchmarks
Correla4on?

•  A comprehensive suite of benchmarks to
–  Compare performance of diﬀerent MPI libraries on various networks and systems
–  Validate low-level funcFonaliFes
–  Provide insights to the underlying MPI-level designs
•  Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-direcFonal bandwidth
•  Extended later to
–  MPI-2 one-sided
–  CollecFves
–  GPU-aware data movement
–  OpenSHMEM (point-to-point and collecFves)
–  UPC
•  Has become an industry standard
•  Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even
in procurement of large-scale systems
•  Available from h<p://mvapich.cse.ohio-state.edu/benchmarks
•  Available in an integrated manner with MVAPICH2 stack

OSU MPI Micro-Benchmarks (OMB) Suite

Big Data Middleware

Programming Models
(Sockets)
Applica4ons
Architectures
Other Protocols?
Point-to-Point
Communica4on
QoS
Threaded Models
and Synchroniza4on
Virtualiza4on
Benchmarks
RDMA Protocols
Itera4ve Process – Requires Deeper Inves4ga4on and Design for
Benchmarking Next Genera4on Big Data Systems and Applica4ons
Applica4ons-Level
Benchmarks
Micro-
Benchmarks

•  HDFS Benchmarks
–  SequenFal Write Latency (SWL) Benchmark
–  SequenFal Read Latency (SRL) Benchmark
–  Random Read Latency (RRL) Benchmark
–  SequenFal Write Throughput (SWT) Benchmark
–  SequenFal Read Throughput (SRT) Benchmark
•  Memcached Benchmarks
–  Get Benchmark
–  Set Benchmark
–  Mixed Get/Set Benchmark
•  Available as a part of OHB 0.8
OSU HiBD Benchmarks (OHB)
N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D.
K. Panda, A Micro-benchmark Suite for
Evalua4ng HDFS Opera4ons on Modern
Clusters, Int'l Workshop on Big Data
Benchmarking (WBDB '12), December 2012

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and
D. K. Panda, A Micro-Benchmark Suite for
Evalua4ng Hadoop MapReduce on High-
Performance Networks, BPOE-5 (2014)
X. Lu, M. W. Rahman, N. Islam, and D. K. Panda,
A Micro-Benchmark Suite for Evalua4ng Hadoop
RPC on High-Performance Networks, Int'l
Workshop on Big Data Benchmarking (WBDB
'13), July 2013
To be Released

•  Upcoming Releases of RDMA-enhanced Packages will support
–  HBase
–  Impala
•  Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support
–  MapReduce
–  RPC
•  Advanced designs with upper-level changes and opFmizaFons
–  Memcached with Non-blocking API
On-going and Future Plans of OSU High Performance Big Data
(HiBD) Project

•  Discussed challenges in acceleraFng Hadoop, Spark and Memcached
•  Presented iniFal designs to take advantage of InﬁniBand/RDMA for HDFS,
MapReduce, RPC, Spark, and Memcached
•  Results are promising
•  Many other open issues need to be solved
•  Will enable Big Data community to take advantage of modern HPC
technologies to carry out their analyFcs in a fast and scalable manner
Concluding Remarks

Funding Acknowledgments
Funding Support by
Equipment Support by

Personnel Acknowledgments
Current Students
–  A. AugusFne (M.S.)
–  A. Awan (Ph.D.)
–  S. Chakraborthy (Ph.D.)
–  C.-H. Chu (Ph.D.)
–  N. Islam (Ph.D.)
–  M. Li (Ph.D.)
Past Students
–  P. Balaji (Ph.D.)
–  S. Bhagvat (M.S.)
–  A. Bhat (M.S.)
–  D. BunFnas (Ph.D.)
–  L. Chai (Ph.D.)
–  B. Chandrasekharan (M.S.)
–  N. Dandapanthula (M.S.)
–  V. Dhanraj (M.S.)
–  T. Gangadharappa (M.S.)
–  K. Gopalakrishnan (M.S.)
–  G. Santhanaraman (Ph.D.)
–  A. Singh (Ph.D.)
–  J. Sridhar (M.S.)
–  S. Sur (Ph.D.)
–  H. Subramoni (Ph.D.)
–  K. Vaidyanathan (Ph.D.)
–  A. Vishnu (Ph.D.)
–  J. Wu (Ph.D.)
–  W. Yu (Ph.D.)
Past Research Scien:st
–  S. Sur
Current Post-Doc
–  J. Lin
–  D. Banerjee
Current Programmer
–  J. Perkins
Past Post-Docs
–  H. Wang
–  X. Besseron
–  H.-W. Jin
–  M. Luo
–  W. Huang (Ph.D.)
–  W. Jiang (M.S.)
–  J. Jose (Ph.D.)
–  S. Kini (M.S.)
–  M. Koop (Ph.D.)
–  R. Kumar (M.S.)
–  S. Krishnamoorthy (M.S.)
–  K. Kandalla (Ph.D.)
–  P. Lai (M.S.)
–  J. Liu (Ph.D.)
–  M. Luo (Ph.D.)
–  A. Mamidala (Ph.D.)
–  G. Marsh (M.S.)
–  V. Meshram (M.S.)
–  A. Moody (M.S.)
–  S. Naravula (Ph.D.)
–  R. Noronha (Ph.D.)
–  X. Ouyang (Ph.D.)
–  S. Pai (M.S.)
–  S. Potluri (Ph.D.)
–  R. Rajachandrasekar (Ph.D.)

–  K. Kulkarni (M.S.)
–  M. Rahman (Ph.D.)
–  D. Shankar (Ph.D.)
–  A. Venkatesh (Ph.D.)
–  J. Zhang (Ph.D.)
–  E. Mancini
–  S. Marcarelli
–  J. Vienne
Current Research Scien:sts Current Senior Research Associate
–  H. Subramoni
–  X. Lu
Past Programmers
–  D. Bureddy
- K. Hamidouche
Current Research Specialist
–  M. Arnold

Second Interna4onal Workshop on
High-Performance Big Data Compu4ng (HPBDC)
HPBDC 2016 will be held with IEEE Interna4onal Parallel and Distributed Processing
Symposium (IPDPS 2016), Chicago, Illinois USA, May 27th, 2016
Keynote Talk: Dr. Chaitanya Baru,
Senior Advisor for Data Science, Na4onal Science Founda4on (NSF);
Dis4nguished Scien4st, San Diego Supercomputer Center (SDSC)

Panel Moderator: Jianfeng Zhan (ICT/CAS)
Panel Topic: Merge or Split: Mutual Inﬂuence between Big Data and HPC Techniques

Six Regular Research Papers and Two Short Research Papers
hWp://web.cse.ohio-state.edu/~luxi/hpbdc2016
HPBDC 2015 was held in conjunc4on with ICDCS’15
hWp://web.cse.ohio-state.edu/~luxi/hpbdc2015

panda@cse.ohio-state.edu
Thank You!
The High-Performance Big Data Project
h<p://hibd.cse.ohio-state.edu/
Network-Based CompuFng Laboratory
h<p://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project
h<p://mvapich.cse.ohio-state.edu/

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing