SlideShare a Scribd company logo
P.K.Gupta, Megh Computing
Accelerating Real Time
Analytics with Spark
Streaming and FPGAaaS
#HWCSAIS17
Agenda
• Using Spark Streaming for Real Time Analytics
• Why FPGA : Low Latency and High Throughput
– Inline Processing
– Offload Processing
• Challenges in Using FPGA accelerators
• Megh Platform
– Arka Runtime
– Sira AFUs
• Demo Applications
• Conclusion
2#HWCSAIS17
Using Spark Streaming with ML / DL
for Real Time Analytics
ETL Data Processing
ML DLStreams
Application
Social
MediaOperations
Transportation
Marketing
Sensors
Web
Queries
Alerts
Analysis
3#HWCSAIS17
Real Time vs. Batch Insights
Real
Time
Secs Mins Hours Days Months
Time
ValueofDatatoDecisionMaking
Information Half-
Life in Decision
Making
Time Critical
Decisions
Traditional “Batch”
Business Intelligence
4#HWCSAIS17
Predictive/
Preventive
Actionable
Reactive
Historical
Real Time Insights
Hard Real
Time
Regular
Trading
Fraud
Prevention
Edge
Computing
Dashboard
(Inference)
Operational
Insights
< 1 us 10s us ms 10s ms seconds100s ms
5#HWCSAIS17
Real Time Analytics platform:
using Heterogeneous CPU+FPGA computing
Data Processing
CPU+FPGA Platform
Social
MediaOperations
Transportati
on
Marketing
Sensors
Web
Queries
Alerts
Analysis
Batch Mode
Real Time
Mode
Public Cloud Private Cloud Edge Cloud
Application
6#HWCSAIS17
In-Line Stream Processing:
using heterogeneous CPU+FPGA platform
7#HWCSAIS17
Worker Node
Executor
Filter #1
Task
System NIC
Worker Node
Executor
FPGA NIC
Filter # 1
FPGA
FPGA terminates Network and dynamically chains filters to provide
pre-processed / low latency DStreams to SPARK apps transparently
Filter #2
Task
MLLib
Task
MLLib
Task
Filter # 2
In-Line Stream Processing:
FPGA Architecture
8#HWCSAIS17
Data input
Packet
Processing
Engine
Filter
Filter
Filter
Filter
Filter
Filter
Streaming Engine
RDDs
FPGA
Sequencer
In-Line Stream Performance
9#HWCSAIS17
Lower Latency Higher Throughput
Source: An FPGA Based Low Latency Network Processing for Spark Streaming, K. Nakamura et.al.
Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
Off-load Processing (ML/DL):
using heterogeneous CPU+FPGA platform
10#HWCSAIS17
Worker Node
SQL
Executor
Task
DLLib
Task
Worker Node
Executor
SQL
Task
DLLib
FPGA
Accelerate ML/DL algorithm transparently by providing
SPARK bindings to FPGA implementations of ML/DL libraries
Off-load ML/DL Processing:
FPGA Architecture
11#HWCSAIS17
Source: CAN FPGAS BEAT GPUS IN ACCELERATING NEXT-GENERATION DEEP LEARNING?
The Next Platform, March 21, 2017
Off-load ML/DL Performance
12#HWCSAIS17
Source: Accelerating Persistent Neural Networks at Datacenter Scale
Eric Chung, et. Al, HotChips, 2017
Lower Latency Higher Throughput
10X
500fps
Challenges in using FPGA
13#HWCSAIS17
Programming
FPGAs
1
Managing
FPGAs in the
DataCenter
2
Integrating
FPGAs into
applications
3
Spark Driver
Client
Application
Worker Node
FPGA Runtime
Executor
Task Shell
AFU AFU
FPGA
FPGA Runtime
Task
Spark Streaming Architecture
using CPU+FPGA platform
Cluster Resource
Manager
14#HWCSAIS17
Spark Context
Driver
Master Node
Megh Platform:
abstracts the complexity of the FPGA
Packet RX
Streaming
Functions
ML / DL
Functions
Packet TX
FPGA
FPGA Driver
Arka Runtime
Java / C++ Library Adaptors
Other App Frameworks
Sira Accelerator Function Units (AFU)
CPU
In-line
Processing1
Off-load
Processing2
Application
Application:
• uses standard APIs
• And/or custom APIs
Arka Runtime:
• FPGA
management
• SW fallback
• Expose AFaaS
Sira Accelerators:
• Downloaded at
Runtime
• Bare Metal or
Exposed to VMs
via VMM
Infrastructure
Components
Megh
Components
15#HWCSAIS17
Virtualized Real Time Analytics Stack
16
#HWCSAIS17
zzz
CPU
FPGA Kernel Driver
VMM VFIO (or Windows equivalent) or PCIe passthrough
Spark Driver/Task
Custom Package/Lib
Arka JNI Access
Utilities: Resource Manager, Scheduler, etc.
ML Package/Lib
ML adapter
. . .
Megh Arka JAVA/SCALA
Arka Runtime
Low Level FPGA Access Lib
VMs
JVMs
JVM
Threads
Application:
• uses standard APIs
• And/or custom APIs
Runtime:
• FPGA management
• SW fallback
• Expose AFaaS
Accelerators:
• Downloaded at
Runtime
• Exposed to VMs via
VMM
FPGAs
Shell
AFU AFU…
In-Line Processing:
Smart rx/tx adaptor architecture
17#HWCSAIS17
CPU
FPGA
Kernel
Space
User
Space
Spark DStream Adapter
DMA (VirtIO)
Packet
Processor Filters
Streaming
Processor
Arka Runtime
FPGA Kernel Driver
Shell
Infrastructure
Components
Megh
Components
• Packet Processor: Intercepts
network packets destined to
Spark
• Filters: Performs data
cleaning, re-size, layout
transforms (ETL operations)
• Streaming Processor:
Creates D-Stream packets for
Spark
public final class JavaSqlNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaSqlNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(Split2Words());
..
}
..
}
18#HWCSAIS17
Inline sample implementation
CPU IMPLEMENTATION
1. Sets up the DStream CPU adapter
connected to System NIC.
2. Configure IP/port on CPU NIC
3. etlLibCPU.jar (CPU implementation)
• split2Words()
• spilt2Sort()
• split2Count()
FPGA IMPLEMENATAION
1. Sets up the DStream FPGA adapter
connected to FPGA NIC.
2. Configures IP/Port on FPGA NIC
3. etlLibCPU.jar(FPGA implementation)
• split2Words()
• spilt2Sort()
• split2Count()
FPGA is setup to stream and filter data -
before passing it to SPARK as DStream object.
* Full implementation
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaSqlNetworkWordCount.java
1
2
3
Off-load Processing:
Low latency off-Load of ML/DL libraries
19#HWCSAIS17
CPU
FPGA
Kernel
Space
User
Space
Spark DStream Adapter
DMA (VirtIO)
ML Libraries DL Libraries
FPGA Kernel Driver
Shell
Infrastructure
Components
Megh
Components
Arka Runtime
Inter-FPGA Network
• Machine Learning Libraries:
Optimized libraries for K-
Means, SVM, etc.
• Deep Learning Libraries:
Optimized libraries for DNN
based inference engines.
• Inter-FPGA Network: FPGA
network for sharing FPGA
resources for larger DNN
topologies
public class JavaKMeansExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample");
JavaSparkContext jsc = new JavaSparkContext(conf);
..
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters,numIterations);
..
double cost = clusters.computeCost(parsedData.rdd());
System.out.println("Cost: " + cost);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
..
jsc.stop();
}
}
20#HWCSAIS17
Offload Sample Implementation
mlib.jar
(CPU library implementation)
• KmeansModel.train()
• KmeansModel.computeCost()
mlibFPGA.jar
(FPGA accelerated library implementation)
• KmeansModel.train()
• KmeansModel.computeCost()
CPU and FPGA share the same function signature -
providing application transparent acceleration by using FPGA library
* Full implementation:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaKMeansExample.java
public static void main( String[] args ) throws Exception {
System.out.println(" Java: NumAdd Spark Demo.n");
Long total = null;
SparkConf sparkConf = new SparkConf().setAppName( “NumAdd“ );
JavaSparkContext ctx = new JavaSparkContext( sparkConf );
JavaRDD<String> lines = ctx.textFile( args[0], 1 );
JavaRDD<Long> sums = lines.map( new sumOneString() );
total = sums.reduce( (a,b) -> (a+b) );
System.out.println( "Total is -> " + total );
ctx.stop();
}
21#HWCSAIS17
numAdd Demo:
Implementation details
numAdd is slight variation of the popular WordCount Sample
where numbers in the files are parsed and added up using SPARK
Accelerated Operation:
sumOneString
AFU.Factory fpgaFactory = new AFU.Factory();
AFU wc = fpgaFactory.createAFU("meghna");
TransferBuffer inbuf = wc.getTransferBuffer( input1.length() );
wc.queueInputBuffer( inbuf );
// Reuse buffer 1 for the output. AFU design ensures this is safe.
wc.queueOutputBuffer( inbuf ); // Arka permits it.
wc.startFunction(); // The real work starts here
TransferBuffer obuff = wc.waitOnOutputQueue();
return ( obuff.getByteBuffer().asLongBuffer().get(0) );
Instantiate AFU as a Service. Enables multiple distinct
implementations to co-exist and be selected dynamically:
specifically, an FPGA implementation and a CPU-based
fallback implementation.
Buffer Queue based model
• (Register interface available but not shown)
AFU optimized Transfer Buffers allow for:
• Zero copy to HW. And efficient access.
• Efficient access from Java/Scala
• AFU specific implementation.
• May use direct byte buffers, SVM, Netty, Apache Arrow
etc…
Start operation.
22#HWCSAIS17
Wait for results in output queue.
Demo: NumAdd Offload Profiling
23#HWCSAIS17
0
50
100
150
200
1M 2M 4M
ExecutionTime
(s)
FileSize
NumAdd
FPGA Offload Spark Streaming
* Executor/task on the worker node restricted to 1 thread
In Summary….
• Megh CPU+FPGA platform optimized for Real
Time Analytics
• Arka Runtime supports different streaming
frameworks
• Sira AFUs deliver low latency and high
throughput for inline and offload processing
24#HWCSAIS17
Thank You
info@meghcomputing.com

More Related Content

PDF
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PDF
Spark Summit EU talk by John Musser
PDF
Continuous Processing in Structured Streaming with Jose Torres
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Speed up UDFs with GPUs using the RAPIDS Accelerator
Spark Summit EU talk by John Musser
Continuous Processing in Structured Streaming with Jose Torres
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...

What's hot (20)

PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Reactive Streams, Linking Reactive Application To Spark Streaming
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Spark Summit EU talk by Jorg Schad
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Top 5 mistakes when writing Streaming applications
PDF
Operational Tips For Deploying Apache Spark
PDF
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
PDF
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Performance Troubleshooting Using Apache Spark Metrics
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Reactive Streams, Linking Reactive Application To Spark Streaming
Spark Summit EU talk by Kaarthik Sivashanmugam
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Efficient State Management With Spark 2.0 And Scale-Out Databases
Spark Summit EU talk by Jorg Schad
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Self-Service Apache Spark Structured Streaming Applications and Analytics
SSR: Structured Streaming for R and Machine Learning
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Top 5 mistakes when writing Streaming applications
Operational Tips For Deploying Apache Spark
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Apache Spark on K8S Best Practice and Performance in the Cloud
Ad

Similar to Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabhat Gupta (20)

PPTX
Introduction to FPGA acceleration
PDF
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PPTX
Stress your DUT
PPTX
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Strata NYC 2015: What's new in Spark Streaming
PPT
Spark streaming with kafka
PPT
Spark stream - Kafka
PDF
So you think you can stream.pptx
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PPTX
Dpdk applications
PPTX
ETL with SPARK - First Spark London meetup
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PDF
SamzaSQL QCon'16 presentation
Introduction to FPGA acceleration
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Spark (Structured) Streaming vs. Kafka Streams
Deep Dive into GPU Support in Apache Spark 3.x
Stress your DUT
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Strata NYC 2015: What's new in Spark Streaming
Spark streaming with kafka
Spark stream - Kafka
So you think you can stream.pptx
Stage Level Scheduling Improving Big Data and AI Integration
Dpdk applications
ETL with SPARK - First Spark London meetup
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Using a Field Programmable Gate Array to Accelerate Application Performance
SamzaSQL QCon'16 presentation
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Mega Projects Data Mega Projects Data
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
Introduction to Business Data Analytics.
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Mega Projects Data Mega Projects Data
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Database Infoormation System (DBIS).pptx
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...

Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabhat Gupta

  • 1. P.K.Gupta, Megh Computing Accelerating Real Time Analytics with Spark Streaming and FPGAaaS #HWCSAIS17
  • 2. Agenda • Using Spark Streaming for Real Time Analytics • Why FPGA : Low Latency and High Throughput – Inline Processing – Offload Processing • Challenges in Using FPGA accelerators • Megh Platform – Arka Runtime – Sira AFUs • Demo Applications • Conclusion 2#HWCSAIS17
  • 3. Using Spark Streaming with ML / DL for Real Time Analytics ETL Data Processing ML DLStreams Application Social MediaOperations Transportation Marketing Sensors Web Queries Alerts Analysis 3#HWCSAIS17
  • 4. Real Time vs. Batch Insights Real Time Secs Mins Hours Days Months Time ValueofDatatoDecisionMaking Information Half- Life in Decision Making Time Critical Decisions Traditional “Batch” Business Intelligence 4#HWCSAIS17 Predictive/ Preventive Actionable Reactive Historical
  • 5. Real Time Insights Hard Real Time Regular Trading Fraud Prevention Edge Computing Dashboard (Inference) Operational Insights < 1 us 10s us ms 10s ms seconds100s ms 5#HWCSAIS17
  • 6. Real Time Analytics platform: using Heterogeneous CPU+FPGA computing Data Processing CPU+FPGA Platform Social MediaOperations Transportati on Marketing Sensors Web Queries Alerts Analysis Batch Mode Real Time Mode Public Cloud Private Cloud Edge Cloud Application 6#HWCSAIS17
  • 7. In-Line Stream Processing: using heterogeneous CPU+FPGA platform 7#HWCSAIS17 Worker Node Executor Filter #1 Task System NIC Worker Node Executor FPGA NIC Filter # 1 FPGA FPGA terminates Network and dynamically chains filters to provide pre-processed / low latency DStreams to SPARK apps transparently Filter #2 Task MLLib Task MLLib Task Filter # 2
  • 8. In-Line Stream Processing: FPGA Architecture 8#HWCSAIS17 Data input Packet Processing Engine Filter Filter Filter Filter Filter Filter Streaming Engine RDDs FPGA Sequencer
  • 9. In-Line Stream Performance 9#HWCSAIS17 Lower Latency Higher Throughput Source: An FPGA Based Low Latency Network Processing for Spark Streaming, K. Nakamura et.al. Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
  • 10. Off-load Processing (ML/DL): using heterogeneous CPU+FPGA platform 10#HWCSAIS17 Worker Node SQL Executor Task DLLib Task Worker Node Executor SQL Task DLLib FPGA Accelerate ML/DL algorithm transparently by providing SPARK bindings to FPGA implementations of ML/DL libraries
  • 11. Off-load ML/DL Processing: FPGA Architecture 11#HWCSAIS17 Source: CAN FPGAS BEAT GPUS IN ACCELERATING NEXT-GENERATION DEEP LEARNING? The Next Platform, March 21, 2017
  • 12. Off-load ML/DL Performance 12#HWCSAIS17 Source: Accelerating Persistent Neural Networks at Datacenter Scale Eric Chung, et. Al, HotChips, 2017 Lower Latency Higher Throughput 10X 500fps
  • 13. Challenges in using FPGA 13#HWCSAIS17 Programming FPGAs 1 Managing FPGAs in the DataCenter 2 Integrating FPGAs into applications 3
  • 14. Spark Driver Client Application Worker Node FPGA Runtime Executor Task Shell AFU AFU FPGA FPGA Runtime Task Spark Streaming Architecture using CPU+FPGA platform Cluster Resource Manager 14#HWCSAIS17 Spark Context Driver Master Node
  • 15. Megh Platform: abstracts the complexity of the FPGA Packet RX Streaming Functions ML / DL Functions Packet TX FPGA FPGA Driver Arka Runtime Java / C++ Library Adaptors Other App Frameworks Sira Accelerator Function Units (AFU) CPU In-line Processing1 Off-load Processing2 Application Application: • uses standard APIs • And/or custom APIs Arka Runtime: • FPGA management • SW fallback • Expose AFaaS Sira Accelerators: • Downloaded at Runtime • Bare Metal or Exposed to VMs via VMM Infrastructure Components Megh Components 15#HWCSAIS17
  • 16. Virtualized Real Time Analytics Stack 16 #HWCSAIS17 zzz CPU FPGA Kernel Driver VMM VFIO (or Windows equivalent) or PCIe passthrough Spark Driver/Task Custom Package/Lib Arka JNI Access Utilities: Resource Manager, Scheduler, etc. ML Package/Lib ML adapter . . . Megh Arka JAVA/SCALA Arka Runtime Low Level FPGA Access Lib VMs JVMs JVM Threads Application: • uses standard APIs • And/or custom APIs Runtime: • FPGA management • SW fallback • Expose AFaaS Accelerators: • Downloaded at Runtime • Exposed to VMs via VMM FPGAs Shell AFU AFU…
  • 17. In-Line Processing: Smart rx/tx adaptor architecture 17#HWCSAIS17 CPU FPGA Kernel Space User Space Spark DStream Adapter DMA (VirtIO) Packet Processor Filters Streaming Processor Arka Runtime FPGA Kernel Driver Shell Infrastructure Components Megh Components • Packet Processor: Intercepts network packets destined to Spark • Filters: Performs data cleaning, re-size, layout transforms (ETL operations) • Streaming Processor: Creates D-Stream packets for Spark
  • 18. public final class JavaSqlNetworkWordCount { private static final Pattern SPACE = Pattern.compile(" "); public static void main(String[] args) throws Exception { if (args.length < 2) { System.err.println("Usage: JavaNetworkWordCount <hostname> <port>"); System.exit(1); } StreamingExamples.setStreamingLogLevels(); // Create the context with a 1 second batch size SparkConf sparkConf = new SparkConf().setAppName("JavaSqlNetworkWordCount"); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); // Create a JavaReceiverInputDStream on target ip:port and count the // words in input stream of n delimited text (eg. generated by 'nc') JavaReceiverInputDStream<String> lines = ssc.socketTextStream( args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER); JavaDStream<String> words = lines.flatMap(Split2Words()); .. } .. } 18#HWCSAIS17 Inline sample implementation CPU IMPLEMENTATION 1. Sets up the DStream CPU adapter connected to System NIC. 2. Configure IP/port on CPU NIC 3. etlLibCPU.jar (CPU implementation) • split2Words() • spilt2Sort() • split2Count() FPGA IMPLEMENATAION 1. Sets up the DStream FPGA adapter connected to FPGA NIC. 2. Configures IP/Port on FPGA NIC 3. etlLibCPU.jar(FPGA implementation) • split2Words() • spilt2Sort() • split2Count() FPGA is setup to stream and filter data - before passing it to SPARK as DStream object. * Full implementation https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaSqlNetworkWordCount.java 1 2 3
  • 19. Off-load Processing: Low latency off-Load of ML/DL libraries 19#HWCSAIS17 CPU FPGA Kernel Space User Space Spark DStream Adapter DMA (VirtIO) ML Libraries DL Libraries FPGA Kernel Driver Shell Infrastructure Components Megh Components Arka Runtime Inter-FPGA Network • Machine Learning Libraries: Optimized libraries for K- Means, SVM, etc. • Deep Learning Libraries: Optimized libraries for DNN based inference engines. • Inter-FPGA Network: FPGA network for sharing FPGA resources for larger DNN topologies
  • 20. public class JavaKMeansExample { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("JavaKMeansExample"); JavaSparkContext jsc = new JavaSparkContext(conf); .. // Cluster the data into two classes using KMeans int numClusters = 2; int numIterations = 20; KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters,numIterations); .. double cost = clusters.computeCost(parsedData.rdd()); System.out.println("Cost: " + cost); // Evaluate clustering by computing Within Set Sum of Squared Errors double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE); .. jsc.stop(); } } 20#HWCSAIS17 Offload Sample Implementation mlib.jar (CPU library implementation) • KmeansModel.train() • KmeansModel.computeCost() mlibFPGA.jar (FPGA accelerated library implementation) • KmeansModel.train() • KmeansModel.computeCost() CPU and FPGA share the same function signature - providing application transparent acceleration by using FPGA library * Full implementation: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaKMeansExample.java
  • 21. public static void main( String[] args ) throws Exception { System.out.println(" Java: NumAdd Spark Demo.n"); Long total = null; SparkConf sparkConf = new SparkConf().setAppName( “NumAdd“ ); JavaSparkContext ctx = new JavaSparkContext( sparkConf ); JavaRDD<String> lines = ctx.textFile( args[0], 1 ); JavaRDD<Long> sums = lines.map( new sumOneString() ); total = sums.reduce( (a,b) -> (a+b) ); System.out.println( "Total is -> " + total ); ctx.stop(); } 21#HWCSAIS17 numAdd Demo: Implementation details numAdd is slight variation of the popular WordCount Sample where numbers in the files are parsed and added up using SPARK
  • 22. Accelerated Operation: sumOneString AFU.Factory fpgaFactory = new AFU.Factory(); AFU wc = fpgaFactory.createAFU("meghna"); TransferBuffer inbuf = wc.getTransferBuffer( input1.length() ); wc.queueInputBuffer( inbuf ); // Reuse buffer 1 for the output. AFU design ensures this is safe. wc.queueOutputBuffer( inbuf ); // Arka permits it. wc.startFunction(); // The real work starts here TransferBuffer obuff = wc.waitOnOutputQueue(); return ( obuff.getByteBuffer().asLongBuffer().get(0) ); Instantiate AFU as a Service. Enables multiple distinct implementations to co-exist and be selected dynamically: specifically, an FPGA implementation and a CPU-based fallback implementation. Buffer Queue based model • (Register interface available but not shown) AFU optimized Transfer Buffers allow for: • Zero copy to HW. And efficient access. • Efficient access from Java/Scala • AFU specific implementation. • May use direct byte buffers, SVM, Netty, Apache Arrow etc… Start operation. 22#HWCSAIS17 Wait for results in output queue.
  • 23. Demo: NumAdd Offload Profiling 23#HWCSAIS17 0 50 100 150 200 1M 2M 4M ExecutionTime (s) FileSize NumAdd FPGA Offload Spark Streaming * Executor/task on the worker node restricted to 1 thread
  • 24. In Summary…. • Megh CPU+FPGA platform optimized for Real Time Analytics • Arka Runtime supports different streaming frameworks • Sira AFUs deliver low latency and high throughput for inline and offload processing 24#HWCSAIS17