Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabhat Gupta

P.K.Gupta, Megh Computing
Accelerating Real Time
Analytics with Spark
Streaming and FPGAaaS
#HWCSAIS17

Agenda
• Using Spark Streaming for Real Time Analytics
• Why FPGA : Low Latency and High Throughput
– Inline Processing
– Offload Processing
• Challenges in Using FPGA accelerators
• Megh Platform
– Arka Runtime
– Sira AFUs
• Demo Applications
• Conclusion
2#HWCSAIS17

Using Spark Streaming with ML / DL
for Real Time Analytics
ETL Data Processing
ML DLStreams
Application
Social
MediaOperations
Transportation
Marketing
Sensors
Web
Queries
Alerts
Analysis
3#HWCSAIS17

Real Time vs. Batch Insights
Real
Time
Secs Mins Hours Days Months
Time
ValueofDatatoDecisionMaking
Information Half-
Life in Decision
Making
Time Critical
Decisions
Traditional “Batch”
Business Intelligence
4#HWCSAIS17
Predictive/
Preventive
Actionable
Reactive
Historical

Real Time Insights
Hard Real
Time
Regular
Trading
Fraud
Prevention
Edge
Computing
Dashboard
(Inference)
Operational
Insights
< 1 us 10s us ms 10s ms seconds100s ms
5#HWCSAIS17

Real Time Analytics platform:
using Heterogeneous CPU+FPGA computing
Data Processing
CPU+FPGA Platform
Social
MediaOperations
Transportati
on
Marketing
Sensors
Web
Queries
Alerts
Analysis
Batch Mode
Real Time
Mode
Public Cloud Private Cloud Edge Cloud
Application
6#HWCSAIS17

In-Line Stream Processing:
using heterogeneous CPU+FPGA platform
7#HWCSAIS17
Worker Node
Executor
Filter #1
Task
System NIC
Worker Node
Executor
FPGA NIC
Filter # 1
FPGA
FPGA terminates Network and dynamically chains filters to provide
pre-processed / low latency DStreams to SPARK apps transparently
Filter #2
Task
MLLib
Task
MLLib
Task
Filter # 2

In-Line Stream Processing:
FPGA Architecture
8#HWCSAIS17
Data input
Packet
Processing
Engine
Filter
Filter
Filter
Filter
Filter
Filter
Streaming Engine
RDDs
FPGA
Sequencer

In-Line Stream Performance
9#HWCSAIS17
Lower Latency Higher Throughput
Source: An FPGA Based Low Latency Network Processing for Spark Streaming, K. Nakamura et.al.
Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

Off-load Processing (ML/DL):
using heterogeneous CPU+FPGA platform
10#HWCSAIS17
Worker Node
SQL
Executor
Task
DLLib
Task
Worker Node
Executor
SQL
Task
DLLib
FPGA
Accelerate ML/DL algorithm transparently by providing
SPARK bindings to FPGA implementations of ML/DL libraries

Off-load ML/DL Processing:
FPGA Architecture
11#HWCSAIS17
Source: CAN FPGAS BEAT GPUS IN ACCELERATING NEXT-GENERATION DEEP LEARNING?
The Next Platform, March 21, 2017

Off-load ML/DL Performance
12#HWCSAIS17
Source: Accelerating Persistent Neural Networks at Datacenter Scale
Eric Chung, et. Al, HotChips, 2017
Lower Latency Higher Throughput
10X
500fps

Challenges in using FPGA
13#HWCSAIS17
Programming
FPGAs
1
Managing
FPGAs in the
DataCenter
2
Integrating
FPGAs into
applications
3

Spark Driver
Client
Application
Worker Node
FPGA Runtime
Executor
Task Shell
AFU AFU
FPGA
FPGA Runtime
Task
Spark Streaming Architecture
using CPU+FPGA platform
Cluster Resource
Manager
14#HWCSAIS17
Spark Context
Driver
Master Node

Megh Platform:
abstracts the complexity of the FPGA
Packet RX
Streaming
Functions
ML / DL
Functions
Packet TX
FPGA
FPGA Driver
Arka Runtime
Java / C++ Library Adaptors
Other App Frameworks
Sira Accelerator Function Units (AFU)
CPU
In-line
Processing1
Off-load
Processing2
Application
Application:
• uses standard APIs
• And/or custom APIs
Arka Runtime:
• FPGA
management
• SW fallback
• Expose AFaaS
Sira Accelerators:
• Downloaded at
Runtime
• Bare Metal or
Exposed to VMs
via VMM
Infrastructure
Components
Megh
Components
15#HWCSAIS17

Virtualized Real Time Analytics Stack
16
#HWCSAIS17
zzz
CPU
FPGA Kernel Driver
VMM VFIO (or Windows equivalent) or PCIe passthrough
Spark Driver/Task
Custom Package/Lib
Arka JNI Access
Utilities: Resource Manager, Scheduler, etc.
ML Package/Lib
ML adapter
. . .
Megh Arka JAVA/SCALA
Arka Runtime
Low Level FPGA Access Lib
VMs
JVMs
JVM
Threads
Application:
• uses standard APIs
• And/or custom APIs
Runtime:
• FPGA management
• SW fallback
• Expose AFaaS
Accelerators:
• Downloaded at
Runtime
• Exposed to VMs via
VMM
FPGAs
Shell
AFU AFU…

In-Line Processing:
Smart rx/tx adaptor architecture
17#HWCSAIS17
CPU
FPGA
Kernel
Space
User
Space
Spark DStream Adapter
DMA (VirtIO)
Packet
Processor Filters
Streaming
Processor
Arka Runtime
FPGA Kernel Driver
Shell
Infrastructure
Components
Megh
Components
• Packet Processor: Intercepts
network packets destined to
Spark
• Filters: Performs data
cleaning, re-size, layout
transforms (ETL operations)
• Streaming Processor:
Creates D-Stream packets for
Spark

public final class JavaSqlNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaSqlNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(Split2Words());
..
}
..
}
18#HWCSAIS17
Inline sample implementation
CPU IMPLEMENTATION
1. Sets up the DStream CPU adapter
connected to System NIC.
2. Configure IP/port on CPU NIC
3. etlLibCPU.jar (CPU implementation)
• split2Words()
• spilt2Sort()
• split2Count()
FPGA IMPLEMENATAION
1. Sets up the DStream FPGA adapter
connected to FPGA NIC.
2. Configures IP/Port on FPGA NIC
3. etlLibCPU.jar(FPGA implementation)
• split2Words()
• spilt2Sort()
• split2Count()
FPGA is setup to stream and filter data -
before passing it to SPARK as DStream object.
* Full implementation
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaSqlNetworkWordCount.java
1
2
3

Off-load Processing:
Low latency off-Load of ML/DL libraries
19#HWCSAIS17
CPU
FPGA
Kernel
Space
User
Space
Spark DStream Adapter
DMA (VirtIO)
ML Libraries DL Libraries
FPGA Kernel Driver
Shell
Infrastructure
Components
Megh
Components
Arka Runtime
Inter-FPGA Network
• Machine Learning Libraries:
Optimized libraries for K-
Means, SVM, etc.
• Deep Learning Libraries:
Optimized libraries for DNN
based inference engines.
• Inter-FPGA Network: FPGA
network for sharing FPGA
resources for larger DNN
topologies

public class JavaKMeansExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample");
JavaSparkContext jsc = new JavaSparkContext(conf);
..
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters,numIterations);
..
double cost = clusters.computeCost(parsedData.rdd());
System.out.println("Cost: " + cost);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
..
jsc.stop();
}
}
20#HWCSAIS17
Offload Sample Implementation
mlib.jar
(CPU library implementation)
• KmeansModel.train()
• KmeansModel.computeCost()
mlibFPGA.jar
(FPGA accelerated library implementation)
• KmeansModel.train()
• KmeansModel.computeCost()
CPU and FPGA share the same function signature -
providing application transparent acceleration by using FPGA library
* Full implementation:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaKMeansExample.java

public static void main( String[] args ) throws Exception {
System.out.println(" Java: NumAdd Spark Demo.n");
Long total = null;
SparkConf sparkConf = new SparkConf().setAppName( “NumAdd“ );
JavaSparkContext ctx = new JavaSparkContext( sparkConf );
JavaRDD<String> lines = ctx.textFile( args[0], 1 );
JavaRDD<Long> sums = lines.map( new sumOneString() );
total = sums.reduce( (a,b) -> (a+b) );
System.out.println( "Total is -> " + total );
ctx.stop();
}
21#HWCSAIS17
numAdd Demo:
Implementation details
numAdd is slight variation of the popular WordCount Sample
where numbers in the files are parsed and added up using SPARK

Accelerated Operation:
sumOneString
AFU.Factory fpgaFactory = new AFU.Factory();
AFU wc = fpgaFactory.createAFU("meghna");
TransferBuffer inbuf = wc.getTransferBuffer( input1.length() );
wc.queueInputBuffer( inbuf );
// Reuse buffer 1 for the output. AFU design ensures this is safe.
wc.queueOutputBuffer( inbuf ); // Arka permits it.
wc.startFunction(); // The real work starts here
TransferBuffer obuff = wc.waitOnOutputQueue();
return ( obuff.getByteBuffer().asLongBuffer().get(0) );
Instantiate AFU as a Service. Enables multiple distinct
implementations to co-exist and be selected dynamically:
specifically, an FPGA implementation and a CPU-based
fallback implementation.
Buffer Queue based model
• (Register interface available but not shown)
AFU optimized Transfer Buffers allow for:
• Zero copy to HW. And efficient access.
• Efficient access from Java/Scala
• AFU specific implementation.
• May use direct byte buffers, SVM, Netty, Apache Arrow
etc…
Start operation.
22#HWCSAIS17
Wait for results in output queue.

Demo: NumAdd Offload Profiling
23#HWCSAIS17
0
50
100
150
200
1M 2M 4M
ExecutionTime
(s)
FileSize
NumAdd
FPGA Offload Spark Streaming
* Executor/task on the worker node restricted to 1 thread

In Summary….
• Megh CPU+FPGA platform optimized for Real
Time Analytics
• Arka Runtime supports different streaming
frameworks
• Sira AFUs deliver low latency and high
throughput for inline and offload processing
24#HWCSAIS17

Thank You
info@meghcomputing.com

Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabhat Gupta

More Related Content

What's hot (20)

Similar to Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabhat Gupta (20)

More from Databricks (20)

Recently uploaded (20)

Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabhat Gupta