SlideShare a Scribd company logo
Build Your Next Apache Spark Job
in .NET Using Mobius
Kaarthik Sivashanmugam
@kaarthikss
Mobius
C# API for building Apache Spark applications in .NET
Motivation
• Enable organizations invested deeply in .NET to
build Apache Spark applications in C#
• Reuse of existing .NET libraries in Spark
applications
Yet Another Language Binding?
Popularity of C#
• StackOverflow.com Developer Survey
• RedMonk Programming Language Rankings
.NET ecosystem ~ enabling languages like F#
Spark Survey Results
MOST IMPORTANT ASPECTS OF SPARK
2015
2016
FASTEST GROWING AREAS FROM 2014 TO 2015
2015
2016
Mobius & Spark
Scala/Java API
SparkR PySpark
C# API
Apache Spark
Spark Apps in C#
Word Count
C#
Scala
F#
Develop & Launch Mobius
Applications
Spark Client
A
Get Mobius release
B
Get Mobius driver
and dependencies
1
Add Reference to
Mobius package in NuGet
2
Develop, debug, test
Mobius driver application
3
Build Mobius driver
Run
sparkclr-submit.cmd
or
sparkclr-submit.sh
C
Runs Spark job
Example: sparkclr-submit.cmd
--master spark://IP:PORT
--total-executor-cores 200
--executor-memory 12g
--exe Pi.exe D:MobiusexamplesPi
Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with
any existing Spark cluster
(Standalone, YARN) in
Windows & Linux
DEMO – RUNNING MOBIUS APP
sparkclr-submit.cmd
> sparkclr-submit.cmd
--exe SparkClrWordCount.exe
C:spark-clr_2.10-1.6.200examplesBatchWordCount
C:tempwcdata.txt
sparkclr-submit.sh in Linux
DEMO – DEBUGGING MOBIUS APP
Debug Mode
> sparkclr-submit.cmd debug
Debug Mobius Word Count Example in VS
(set CSharpWorkerPath in config)
DEMO – USING MOBIUS C# SHELL
sparkclr-shell.cmd
> var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2);
> rdd.Reduce((x,y) => x+y) //prints sum of the values;
DEMO – USING F# SHELL
fsi.exe
> sparkclr-submit.cmd debug
> fsi --use:C:tempmobius-init.fsx
let dataframe =
session.Read().Json(@"C:tempdata.json");;
dataframe.Show();;
dataframe.ShowSchema();;
dataframe.Count();;
Kafka Streaming Example
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing
Mobius in Linux
• Mono is used for using Mobius with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Mobius validated in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in
Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub
Project Info
• https://github.com/Microsoft/Mobius
Contributions welcome!
• MIT license
• Discussions
– StackOverflow: tag “SparkCLR”
– Gitter: https://gitter.im/Microsoft/Mobius
– Twitter: @MobiusForSpark
CSharpRDD
• C# operations use CSharpRDD which needs CLR to execute
– If no C# transformation or UDF, CLR is not needed ~ execution is
entirely JVM-based
• RDD<byte[]>
– Data is stored as serialized objects and sent to C# worker process
• Transformations are pipelined when possible
– Avoids unnecessary serialization & deserialization within a stage
Performance Considerations
• Map & Filter RDD operations in C# require serialization & deserialization of
data ~ impacts performance
– C# operations are pipelined when possible ~ minimizes Ser/De
– Persistence is handled by JVM ~ checkpoint/cache on a RDD impacts pipelining for
CLR operations
• DataFrame operations without C# UDFs do not require Ser/De
– Perf will be same as native Scala-based Spark application
– Execution plan optimization & code generation perf improvements in Spark leveraged
INTERNALS OF DRIVER &
WORKER
Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmd
or
sparkclr-submit.sh
CSharpBackendLaunch Netty server creating
proxy for JVM calls
2
C# Driver
Launch C# process
using port number
from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …
1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3
Read bytes
5
Write bytes 4
Execute C# operation
1
Compute
THANK YOU.
• Mobius is production-ready & Cloud-ready
• Use Mobius to build Apache Spark jobs in .NET
• Contribute to github.com/Microsoft/Mobius
• @MobiusForSpark

More Related Content

PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Spark Summit EU talk by John Musser
PDF
Spark Summit EU talk by Rolf Jagerman
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by John Musser
Spark Summit EU talk by Rolf Jagerman
Simplifying Big Data Applications with Apache Spark 2.0
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Bas Geerdink

What's hot (20)

PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Spark Summit EU talk by Christos Erotocritou
PDF
Spark Uber Development Kit
PDF
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
PDF
Spark Summit EU talk by Simon Whitear
PDF
Spark Summit EU talk by Heiko Korndorf
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PDF
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
PDF
Operational Tips For Deploying Apache Spark
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Spark Summit EU talk by Stephan Kessler
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
PDF
SSR: Structured Streaming for R and Machine Learning
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Efficient State Management With Spark 2.0 And Scale-Out Databases
Spark Summit EU talk by Christos Erotocritou
Spark Uber Development Kit
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Heiko Korndorf
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Operational Tips For Deploying Apache Spark
Spark Summit EU talk by Bas Geerdink
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit EU talk by Stephan Kessler
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
SSR: Structured Streaming for R and Machine Learning
Ad

Viewers also liked (20)

PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Spark Summit EU talk by Josef Habdank
PDF
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
PDF
Spark Summit EU talk by Jaroslav Bachorik and Adrian Popescu
PDF
Spark Summit EU talk by Jorg Schad
PDF
Spark Summit EU talk by William Benton
PDF
Spark Summit EU talk by Johnathan Mercer
PDF
Spark Summit EU talk by Tug Grall
PDF
Spark Summit EU talk by Brij Bhushan Ravat
PDF
Spark Summit EU talk by Mike Percy
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PDF
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PPTX
Software services
PPTX
Visiting “Minister of the Republic of Turkey’s Ministry of Forestry and Water...
PPS
Brejeirisses
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Jaroslav Bachorik and Adrian Popescu
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by William Benton
Spark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Software services
Visiting “Minister of the Republic of Turkey’s Ministry of Forestry and Water...
Brejeirisses
Ad

Similar to Spark Summit EU talk by Kaarthik Sivashanmugam (20)

PPTX
Developing apache spark jobs in .net using mobius
PDF
Mobius: C# Language Binding For Spark
PPTX
Spark Summit - Mobius C# Binding for Apache Spark
PPTX
Seattle Spark Meetup Mobius CSharp API
PPTX
Apache Spark Fundamentals
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PDF
Apache Spark at Viadeo
PDF
spark_v1_2
PPTX
Big Data Processing with Apache Spark 2014
PPTX
Spark 101 - First steps to distributed computing
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
How Apache Spark fits into the Big Data landscape
PDF
Productionizing Spark and the Spark Job Server
PDF
How Apache Spark fits in the Big Data landscape
PPTX
Apache Spark for Beginners
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
PPTX
Typesafe spark- Zalando meetup
PDF
What's new with Apache Spark?
Developing apache spark jobs in .net using mobius
Mobius: C# Language Binding For Spark
Spark Summit - Mobius C# Binding for Apache Spark
Seattle Spark Meetup Mobius CSharp API
Apache Spark Fundamentals
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Apache Spark at Viadeo
spark_v1_2
Big Data Processing with Apache Spark 2014
Spark 101 - First steps to distributed computing
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
How Apache Spark fits into the Big Data landscape
Productionizing Spark and the Spark Job Server
How Apache Spark fits in the Big Data landscape
Apache Spark for Beginners
Productionizing Spark and the REST Job Server- Evan Chan
Powering tensor flow with big data using apache beam, flink, and spark cern...
Typesafe spark- Zalando meetup
What's new with Apache Spark?

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Database Infoormation System (DBIS).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Computer network topology notes for revision
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Business Data Analytics.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Global journeys: estimating international migration
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Moving the Public Sector (Government) to a Digital Adoption
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Galatica Smart Energy Infrastructure Startup Pitch Deck
Computer network topology notes for revision
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
1_Introduction to advance data techniques.pptx
Introduction to Business Data Analytics.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Global journeys: estimating international migration

Spark Summit EU talk by Kaarthik Sivashanmugam

Editor's Notes

  • #11: using System.Linq; var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2); rdd.Reduce((x,y) => x+y) //prints sum of the values; rdd.Count() rdd.Map(n => n+1).Reduce((x,y) => x+y)
  • #13: using System.Linq; var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2); rdd.Reduce((x,y) => x+y) //prints sum of the values; rdd.Count() rdd.Map(n => n+1).Reduce((x,y) => x+y)
  • #14: using System.Linq; var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2)
  • #15: using System.Linq; var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2); rdd.Reduce((x,y) => x+y) //prints sum of the values; rdd.Count() rdd.Map(n => n+1).Reduce((x,y) => x+y)
  • #17: using System.Linq; var rdd = sc.Parallelize(Enumerable.Range(0, 100), 2); rdd.Reduce((x,y) => x+y) //prints sum of the values; rdd.Count() rdd.Map(n => n+1).Reduce((x,y) => x+y)