SlideShare a Scribd company logo
Heiko Korndorf, Wireframe
An Update on
Scaling Data Science
with SparkR
#DSSAIS18
Agenda
• About Me
• Spark & R
• Spark Architecture
• Spark DataFrames and SparkSQL
• Natively Distributed ML with Spark ML
• Big Compute: Parallelization with Spark UDFs
• ML & Data-in-motion: Spark Streaming
• Tips & Pitfalls
• What About Python?
• Summary & Outlook
2#DSSAIS18
About Me
3#DSSAIS18
• MSc in Computer Science, University of Zurich
• > 20 Years in Consulting
• Finance, Energy, Telco, Pharma, Manufacturing
• EAI, BI/Data Warehousing, CRM, ERP, Technology
• Speaker at SparkSummit, HadoopSummit, and others
• Founder & CEO Wireframe AG
• PatternFinder: Data Science Data Warehouse / Business Machine Intelligence
• PatternGenerator: Development Tool for Streaming Applications
• https://wireframe.digital
SparkR Architecture
4#DSSAIS18
Integration
Libraries
Core
• Execute R on cluster
• Master/Slave
• Out-Of-Memory Datasets
• Access Data Lake
• Powerful Libraries
• Machine Learning
• SQL
• Streaming
• R Integration through SparkR
Data (Lake) Access
5#DSSAIS18
• Ability to read Big Data File Formats
• HDFS, AWS S3, Azure WASB, …
• Parquet, CSV, JSON, ORC, ...
• Security
• Fine grained authorization
• Role-/Attribute-based Access Control
• Governance
• Metadata Management
• Lineage
DataFrame
SparkSQL
6#DSSAIS18
• Execute SQL against Spark DataFrame
• SELECT
• Specify Projection
• WHERE
• Filter criteria
• GROUPBY
• Group/Aggregate
• JOIN
• Join tables
• Alternatively, use
• select(), where(), groupBy(),
count(), etc.
Spark MLlib
7#DSSAIS18
SparkR & Streaming
8#DSSAIS18
Use R to process data streams
• Structured Streaming:
• DataFrames with streaming sources
• New data in stream is appended to an
unbounded table (i.e. DataFrame)
• Seamless integration:
• read.stream(“kafka”, ….)
Data Stream Unbounded Table
SparkR UDFs
9#DSSAIS18
SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect()
• Apply a function to each group/partition of a Spark DataFrame
• Input: Grouping Key and DataFrame partition as R data.frame
• Output: R data.frame
SparkR UDFs
10#DSSAIS18
SparkR spark.lapply()
• Run a function over a list of elements and
• Distribute the computation with Spark
Big Compute
11#DSSAIS18
Areas where massively parallel computation is relevant:
• Ensemble Learning for Time Series
• Hyperparameter Sweeps
• High-Dimensional Problem/Search-Space
• Wireframe PatternFinder
• Shape/Motif Detection
• IoT Pattern/Shapes
• Monte-Carlo Simulation
• Value-at-Risk (Finance)
• Catastrophe Modeling (Reinsurance)
• Inventory Optimization (Oil & Gas, Manufacturing)
Big Compute
12#DSSAIS18
Massive Time Series Forecasting
• Sequential computation: > 22 hours
• Single-Server, parallelized: > 4.5 hours
• SparkR, Cluster w/ 25 nodes: ~ 12 minutes
Wireframe PatternFinder
• 15.500.000 Models to be computed
• 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100
Segments
• 0.1 Sec. per Model
• Sequential: ~ 18 Days
• 1000 Cores: ~ 26 Minutes
Implications
• Minor refactoring of R code
• Massive cost reduction by using elastically scaling of Cloud resources
Tips & Pitfalls
13#DSSAIS18
• Generate diagrams in SparkR
• PDF, HTML, Rmarkdown
• Store in shared persistence or serialize into Spark DataFrame
• SPARK-21866 might be helpful?
• Store complex object (models) from SparkR
• saveRDS() saves to local storage
• Store in shared persistence or serialize into Spark DataFrame
• Run R on a YARN cluster w/o locally installed R
• / / -- - / .- . ../ . / -
• Mixing Scala & R
• Not supported by Oozie’s SparkAction
• Can be replaced with ShellAction
• Not supported by Apache Livy
• Only support for Scala, Java, and Python
And What About Python?
14#DSSAIS18
“Do I need to learn Python?”
Let’s compare (Spark)R & Py(Spark):
• Language: Interpreted Languages, R (1993), Python (1991)
• Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas
• Package Management
• IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM
Watson Studio, Cloudera Data Science Workbench, ….
And there’s more:
Market Momentum Deep Learning
Spark Support Spark Integration
Market Momentum
15#DSSAIS18
Redmonk (Jan 2018)
1. JavaScript
2. Java
3. Python
4. PHP
5. C#
.
.
10. Swift
11. Objective-C
12. R
TIOBE index (May 2018):
4. Python
11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/
Deep Learning
16#DSSAIS18
R Python Other APIs
TensorFlow No Yes C++, Java, Go, Swift
Keras Yes Yes No
MXNet Yes Yes C++, Scala, Julia, Perl
PyTorch No Yes No
CNTK No Yes C++
• Python is a first-class citizen in the Deep Learning/Neural Network world
• Using R with these DL frameworks is possible but more complex
SparkR v PySpark
17#DSSAIS18
SparkR PySpark
Data Lake Integration Yes Yes
Spark SQL Yes Yes
Spark ML Yes Yes
UDFs Yes Yes
Streaming Yes Yes
• Both R and Python can access the same types of Spark APIs
Spark Integration
18#DSSAIS18
JVM vs Non-JVM
• Spark Executors run in JVM
• R & Python run in different processes
• Data must be moved between both
environments (SerDe)
• Low performance
Apache Arrow
19#DSSAIS18
In-Memory Columnar Data Format
• Cross-Language
• Optimized for numeric data
• Zero-copy reads (no serialization)
• Support
• Spark 2.3
• Python
• R Bindings not available yet
• And more: Parquet, HBase, Cassandra, …
• Will also improve R-Python-Integration (rpy2, reticulate)
See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/
DataFrame
Summary & Outlook
20#DSSAIS18
• Spark is the best option to scale R
• See also sparklyr, R Server for Spark
• Common Environment for Dev and Production
• “Looks like R to Data Science,
looks like Spark to Data Engineers”
• Security & Data Governance
• Row-/Column-Level Access Control
• Full Data Lineage (Up- and Downstream
• Shared Memory Format
• Apache Arrow!
• Mix and Match R, Python, and Scala
• Towards an Open Data Science Platform
Thank You!
21# DSSAIS18
Heiko Korndorf
heiko.korndorf@wireframe.digital

More Related Content

PDF
Detecting Mobile Malware with Apache Spark with David Pryce
PDF
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
PDF
Big Data Meets Learning Science: Keynote by Al Essa
PDF
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
PDF
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
PDF
Data Warehousing with Spark Streaming at Zalando
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Detecting Mobile Malware with Apache Spark with David Pryce
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Big Data Meets Learning Science: Keynote by Al Essa
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Data Warehousing with Spark Streaming at Zalando
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

What's hot (20)

PDF
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
PDF
Spark at Airbnb
PDF
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
PDF
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
PDF
Flink London meetup 3 March 2016 - Flink basics
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PDF
Big data knolx
PDF
Automated Production Ready ML at Scale
PDF
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
PPTX
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
PPTX
Data-Driven @ Netflix
PDF
Building Robust Production Data Pipelines with Databricks Delta
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
PPTX
Data analytics at a petabyte scale final
PDF
Bridging the Gap Between Datasets and DataFrames
PDF
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
PPTX
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
PPTX
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Build Your Own Recommendation Engine
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
Spark at Airbnb
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
Flink London meetup 3 March 2016 - Flink basics
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Big data knolx
Automated Production Ready ML at Scale
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Data-Driven @ Netflix
Building Robust Production Data Pipelines with Databricks Delta
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Data analytics at a petabyte scale final
Bridging the Gap Between Datasets and DataFrames
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Build Your Own Recommendation Engine
Ad

Similar to An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf (20)

PDF
Data processing with spark in r & python
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Sparkr sigmod
PPTX
Machine Learning with SparkR
PPTX
In Memory Analytics with Apache Spark
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Apache Spark and Python: unified Big Data analytics
PDF
Bds session 13 14
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Spark meetup TCHUG
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Glint with Apache Spark
PPTX
PPTX
big data analytics (BAD601) Module-5.pptx
PPTX
Apache spark
PPTX
Apachespark 160612140708
Data processing with spark in r & python
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Sparkr sigmod
Machine Learning with SparkR
In Memory Analytics with Apache Spark
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Processing Large Data with Apache Spark -- HasGeek
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Apache Spark and Python: unified Big Data analytics
Bds session 13 14
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Spark meetup TCHUG
Apache Spark for Everyone - Women Who Code Workshop
Glint with Apache Spark
big data analytics (BAD601) Module-5.pptx
Apache spark
Apachespark 160612140708
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
annual-report-2024-2025 original latest.
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction-to-Cloud-ComputingFinal.pptx
Foundation of Data Science unit number two notes
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Fluorescence-microscope_Botany_detailed content
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
Introduction to machine learning and Linear Models
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf

  • 1. Heiko Korndorf, Wireframe An Update on Scaling Data Science with SparkR #DSSAIS18
  • 2. Agenda • About Me • Spark & R • Spark Architecture • Spark DataFrames and SparkSQL • Natively Distributed ML with Spark ML • Big Compute: Parallelization with Spark UDFs • ML & Data-in-motion: Spark Streaming • Tips & Pitfalls • What About Python? • Summary & Outlook 2#DSSAIS18
  • 3. About Me 3#DSSAIS18 • MSc in Computer Science, University of Zurich • > 20 Years in Consulting • Finance, Energy, Telco, Pharma, Manufacturing • EAI, BI/Data Warehousing, CRM, ERP, Technology • Speaker at SparkSummit, HadoopSummit, and others • Founder & CEO Wireframe AG • PatternFinder: Data Science Data Warehouse / Business Machine Intelligence • PatternGenerator: Development Tool for Streaming Applications • https://wireframe.digital
  • 4. SparkR Architecture 4#DSSAIS18 Integration Libraries Core • Execute R on cluster • Master/Slave • Out-Of-Memory Datasets • Access Data Lake • Powerful Libraries • Machine Learning • SQL • Streaming • R Integration through SparkR
  • 5. Data (Lake) Access 5#DSSAIS18 • Ability to read Big Data File Formats • HDFS, AWS S3, Azure WASB, … • Parquet, CSV, JSON, ORC, ... • Security • Fine grained authorization • Role-/Attribute-based Access Control • Governance • Metadata Management • Lineage DataFrame
  • 6. SparkSQL 6#DSSAIS18 • Execute SQL against Spark DataFrame • SELECT • Specify Projection • WHERE • Filter criteria • GROUPBY • Group/Aggregate • JOIN • Join tables • Alternatively, use • select(), where(), groupBy(), count(), etc.
  • 8. SparkR & Streaming 8#DSSAIS18 Use R to process data streams • Structured Streaming: • DataFrames with streaming sources • New data in stream is appended to an unbounded table (i.e. DataFrame) • Seamless integration: • read.stream(“kafka”, ….) Data Stream Unbounded Table
  • 9. SparkR UDFs 9#DSSAIS18 SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect() • Apply a function to each group/partition of a Spark DataFrame • Input: Grouping Key and DataFrame partition as R data.frame • Output: R data.frame
  • 10. SparkR UDFs 10#DSSAIS18 SparkR spark.lapply() • Run a function over a list of elements and • Distribute the computation with Spark
  • 11. Big Compute 11#DSSAIS18 Areas where massively parallel computation is relevant: • Ensemble Learning for Time Series • Hyperparameter Sweeps • High-Dimensional Problem/Search-Space • Wireframe PatternFinder • Shape/Motif Detection • IoT Pattern/Shapes • Monte-Carlo Simulation • Value-at-Risk (Finance) • Catastrophe Modeling (Reinsurance) • Inventory Optimization (Oil & Gas, Manufacturing)
  • 12. Big Compute 12#DSSAIS18 Massive Time Series Forecasting • Sequential computation: > 22 hours • Single-Server, parallelized: > 4.5 hours • SparkR, Cluster w/ 25 nodes: ~ 12 minutes Wireframe PatternFinder • 15.500.000 Models to be computed • 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100 Segments • 0.1 Sec. per Model • Sequential: ~ 18 Days • 1000 Cores: ~ 26 Minutes Implications • Minor refactoring of R code • Massive cost reduction by using elastically scaling of Cloud resources
  • 13. Tips & Pitfalls 13#DSSAIS18 • Generate diagrams in SparkR • PDF, HTML, Rmarkdown • Store in shared persistence or serialize into Spark DataFrame • SPARK-21866 might be helpful? • Store complex object (models) from SparkR • saveRDS() saves to local storage • Store in shared persistence or serialize into Spark DataFrame • Run R on a YARN cluster w/o locally installed R • / / -- - / .- . ../ . / - • Mixing Scala & R • Not supported by Oozie’s SparkAction • Can be replaced with ShellAction • Not supported by Apache Livy • Only support for Scala, Java, and Python
  • 14. And What About Python? 14#DSSAIS18 “Do I need to learn Python?” Let’s compare (Spark)R & Py(Spark): • Language: Interpreted Languages, R (1993), Python (1991) • Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas • Package Management • IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM Watson Studio, Cloudera Data Science Workbench, …. And there’s more: Market Momentum Deep Learning Spark Support Spark Integration
  • 15. Market Momentum 15#DSSAIS18 Redmonk (Jan 2018) 1. JavaScript 2. Java 3. Python 4. PHP 5. C# . . 10. Swift 11. Objective-C 12. R TIOBE index (May 2018): 4. Python 11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/
  • 16. Deep Learning 16#DSSAIS18 R Python Other APIs TensorFlow No Yes C++, Java, Go, Swift Keras Yes Yes No MXNet Yes Yes C++, Scala, Julia, Perl PyTorch No Yes No CNTK No Yes C++ • Python is a first-class citizen in the Deep Learning/Neural Network world • Using R with these DL frameworks is possible but more complex
  • 17. SparkR v PySpark 17#DSSAIS18 SparkR PySpark Data Lake Integration Yes Yes Spark SQL Yes Yes Spark ML Yes Yes UDFs Yes Yes Streaming Yes Yes • Both R and Python can access the same types of Spark APIs
  • 18. Spark Integration 18#DSSAIS18 JVM vs Non-JVM • Spark Executors run in JVM • R & Python run in different processes • Data must be moved between both environments (SerDe) • Low performance
  • 19. Apache Arrow 19#DSSAIS18 In-Memory Columnar Data Format • Cross-Language • Optimized for numeric data • Zero-copy reads (no serialization) • Support • Spark 2.3 • Python • R Bindings not available yet • And more: Parquet, HBase, Cassandra, … • Will also improve R-Python-Integration (rpy2, reticulate) See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ DataFrame
  • 20. Summary & Outlook 20#DSSAIS18 • Spark is the best option to scale R • See also sparklyr, R Server for Spark • Common Environment for Dev and Production • “Looks like R to Data Science, looks like Spark to Data Engineers” • Security & Data Governance • Row-/Column-Level Access Control • Full Data Lineage (Up- and Downstream • Shared Memory Format • Apache Arrow! • Mix and Match R, Python, and Scala • Towards an Open Data Science Platform
  • 21. Thank You! 21# DSSAIS18 Heiko Korndorf heiko.korndorf@wireframe.digital