An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf

Heiko Korndorf, Wireframe
An Update on
Scaling Data Science
with SparkR
#DSSAIS18

Agenda
• About Me
• Spark & R
• Spark Architecture
• Spark DataFrames and SparkSQL
• Natively Distributed ML with Spark ML
• Big Compute: Parallelization with Spark UDFs
• ML & Data-in-motion: Spark Streaming
• Tips & Pitfalls
• What About Python?
• Summary & Outlook
2#DSSAIS18

About Me
3#DSSAIS18
• MSc in Computer Science, University of Zurich
• > 20 Years in Consulting
• Finance, Energy, Telco, Pharma, Manufacturing
• EAI, BI/Data Warehousing, CRM, ERP, Technology
• Speaker at SparkSummit, HadoopSummit, and others
• Founder & CEO Wireframe AG
• PatternFinder: Data Science Data Warehouse / Business Machine Intelligence
• PatternGenerator: Development Tool for Streaming Applications
• https://wireframe.digital

SparkR Architecture
4#DSSAIS18
Integration
Libraries
Core
• Execute R on cluster
• Master/Slave
• Out-Of-Memory Datasets
• Access Data Lake
• Powerful Libraries
• Machine Learning
• SQL
• Streaming
• R Integration through SparkR

Data (Lake) Access
5#DSSAIS18
• Ability to read Big Data File Formats
• HDFS, AWS S3, Azure WASB, …
• Parquet, CSV, JSON, ORC, ...
• Security
• Fine grained authorization
• Role-/Attribute-based Access Control
• Governance
• Metadata Management
• Lineage
DataFrame

SparkSQL
6#DSSAIS18
• Execute SQL against Spark DataFrame
• SELECT
• Specify Projection
• WHERE
• Filter criteria
• GROUPBY
• Group/Aggregate
• JOIN
• Join tables
• Alternatively, use
• select(), where(), groupBy(),
count(), etc.

SparkR & Streaming
8#DSSAIS18
Use R to process data streams
• Structured Streaming:
• DataFrames with streaming sources
• New data in stream is appended to an
unbounded table (i.e. DataFrame)
• Seamless integration:
• read.stream(“kafka”, ….)
Data Stream Unbounded Table

SparkR UDFs
9#DSSAIS18
SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect()
• Apply a function to each group/partition of a Spark DataFrame
• Input: Grouping Key and DataFrame partition as R data.frame
• Output: R data.frame

SparkR UDFs
10#DSSAIS18
SparkR spark.lapply()
• Run a function over a list of elements and
• Distribute the computation with Spark

Big Compute
11#DSSAIS18
Areas where massively parallel computation is relevant:
• Ensemble Learning for Time Series
• Hyperparameter Sweeps
• High-Dimensional Problem/Search-Space
• Wireframe PatternFinder
• Shape/Motif Detection
• IoT Pattern/Shapes
• Monte-Carlo Simulation
• Value-at-Risk (Finance)
• Catastrophe Modeling (Reinsurance)
• Inventory Optimization (Oil & Gas, Manufacturing)

Big Compute
12#DSSAIS18
Massive Time Series Forecasting
• Sequential computation: > 22 hours
• Single-Server, parallelized: > 4.5 hours
• SparkR, Cluster w/ 25 nodes: ~ 12 minutes
Wireframe PatternFinder
• 15.500.000 Models to be computed
• 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100
Segments
• 0.1 Sec. per Model
• Sequential: ~ 18 Days
• 1000 Cores: ~ 26 Minutes
Implications
• Minor refactoring of R code
• Massive cost reduction by using elastically scaling of Cloud resources

Tips & Pitfalls
13#DSSAIS18
• Generate diagrams in SparkR
• PDF, HTML, Rmarkdown
• Store in shared persistence or serialize into Spark DataFrame
• SPARK-21866 might be helpful?
• Store complex object (models) from SparkR
• saveRDS() saves to local storage
• Store in shared persistence or serialize into Spark DataFrame
• Run R on a YARN cluster w/o locally installed R
• / / -- - / .- . ../ . / -
• Mixing Scala & R
• Not supported by Oozie’s SparkAction
• Can be replaced with ShellAction
• Not supported by Apache Livy
• Only support for Scala, Java, and Python

And What About Python?
14#DSSAIS18
“Do I need to learn Python?”
Let’s compare (Spark)R & Py(Spark):
• Language: Interpreted Languages, R (1993), Python (1991)
• Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas
• Package Management
• IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM
Watson Studio, Cloudera Data Science Workbench, ….
And there’s more:
Market Momentum Deep Learning
Spark Support Spark Integration

Market Momentum
15#DSSAIS18
Redmonk (Jan 2018)
1. JavaScript
2. Java
3. Python
4. PHP
5. C#
.
.
10. Swift
11. Objective-C
12. R
TIOBE index (May 2018):
4. Python
11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/

Deep Learning
16#DSSAIS18
R Python Other APIs
TensorFlow No Yes C++, Java, Go, Swift
Keras Yes Yes No
MXNet Yes Yes C++, Scala, Julia, Perl
PyTorch No Yes No
CNTK No Yes C++
• Python is a first-class citizen in the Deep Learning/Neural Network world
• Using R with these DL frameworks is possible but more complex

SparkR v PySpark
17#DSSAIS18
SparkR PySpark
Data Lake Integration Yes Yes
Spark SQL Yes Yes
Spark ML Yes Yes
UDFs Yes Yes
Streaming Yes Yes
• Both R and Python can access the same types of Spark APIs

Spark Integration
18#DSSAIS18
JVM vs Non-JVM
• Spark Executors run in JVM
• R & Python run in different processes
• Data must be moved between both
environments (SerDe)
• Low performance

Apache Arrow
19#DSSAIS18
In-Memory Columnar Data Format
• Cross-Language
• Optimized for numeric data
• Zero-copy reads (no serialization)
• Support
• Spark 2.3
• Python
• R Bindings not available yet
• And more: Parquet, HBase, Cassandra, …
• Will also improve R-Python-Integration (rpy2, reticulate)
See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/
DataFrame

Summary & Outlook
20#DSSAIS18
• Spark is the best option to scale R
• See also sparklyr, R Server for Spark
• Common Environment for Dev and Production
• “Looks like R to Data Science,
looks like Spark to Data Engineers”
• Security & Data Governance
• Row-/Column-Level Access Control
• Full Data Lineage (Up- and Downstream
• Shared Memory Format
• Apache Arrow!
• Mix and Match R, Python, and Scala
• Towards an Open Data Science Platform

Thank You!
21# DSSAIS18
Heiko Korndorf
heiko.korndorf@wireframe.digital

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf

More Related Content

What's hot (20)

Similar to An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf (20)

More from Databricks (20)

Recently uploaded (20)

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf