SlideShare a Scribd company logo
Apache Spark Usage in the
Open Source Ecosystem
Hossein Falaki
@mhfalaki
About me
• Software Engineer /part-time Data Scientist atDatabricks
• I started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR and Rnotebooks at Databricks
2
Stackoverflow 2016 trending tech
3
Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
Databricks Community Edition
• In February Databricks launched a free version of its cloud based
platform in beta
• Since then more than 8,000 users registered
• Users created over 61,000 notebooks indifferent languages
• This is an analysis of third party libraries that our beta users
imported to complement Apache Spark in Scala, Python, and R
5
What % of users use other libraries
Language %	users	importing external	libs Average	#	libs Median	#	libs
Python 75	% 9 2
Scala 55	% 3 1
R 57	% 6 1
6
Installing libraries is easy
7
Python Packages
8
Most popular Python packages
9
What is test_helper?
10
What are these?
ETL
• re
• datetime
• pandas
• json
• csv
• string
• math /operator
• urllib /urllib2
11
Visualization
• matplotlib
• ggplot
• seaborn
Advanced analytics
• numpy
• sklearn
• graphframes
• tensorflow
• scipy
Other
• test_helper
• os
• md5
Python package categories
12
What packages go together?
13
Scala Packages
14
Most popular Scala libraries
15
What are these?
ETL
• java/scala util
• scala.collection
• scala.math
• java.{io, nio}
• java.text
• o.a.commons
• kafka
• twitter4j
16
Visualization
• ?
Advanced analytics
• spark.ml
• graphframes
Other
• java.net
• scala.sys
Scala package categories
17
What libraries go together?
18
R Packages
19
Most popular R packages
20
What are these?
ETL
• dplyr
• plyr
• reshape2
• jsonlite
• tidyr
• lubridate
• httr
• data.table
21
Visualization
• ggplot2
• beanplot
• plotly
• ...
Advanced analytics
• sparkr
• h2o
• caret
• e1071
Other
• devtools
• magrittr
R package categories
22
Comparing Python, Scala & R
23
Languages have unique features
24
Scala/ Python / R R / Python Scala / Python/ R
• 25 % of users,use multiple languages
• 3% of notebooks mix different languages
Summary
• Spark users extensively mix itwith other packages in different languages
– One ofgoals ofSpark project is working well with other projects
• ETL related libraries are the most popular category
– Opportunities for newdata sources
• Notebooks are being used for “small data” aswell as“big data.”
• Languages and their ecosystems have diverse capabilities. Users seem to
be mixing languages to their advantage
– Scala is missing visualization libraries
25
Try your favorite library in Databricks
26
http://databricks.com/ce
Try latest version of Apache Spark and previewof Spark 2.0
Thank you!
What packages are used together?
28

More Related Content

PDF
New directions for Apache Spark in 2015
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
New directions for Apache Spark in 2015
Composable Parallel Processing in Apache Spark and Weld
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
SSR: Structured Streaming for R and Machine Learning
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark

What's hot (20)

PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Spark Meetup at Uber
PDF
Distributed ML in Apache Spark
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Enabling exploratory data science with Spark and R
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Operational Tips for Deploying Spark
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
PDF
Spark Summit EU talk by Tim Hunter
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Large-Scale Data Science in Apache Spark 2.0
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Meetup at Uber
Distributed ML in Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Spark Under the Hood - Meetup @ Data Science London
Jump Start with Apache Spark 2.0 on Databricks
Enabling exploratory data science with Spark and R
Jump Start with Apache Spark 2.0 on Databricks
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Spark Summit EU talk by Shay Nativ and Dvir Volk
A look under the hood at Apache Spark's API and engine evolutions
What's New in Apache Spark 2.3 & Why Should You Care
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Operational Tips for Deploying Spark
Jump Start on Apache® Spark™ 2.x with Databricks
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Spark Summit EU talk by Tim Hunter
Ad

Viewers also liked (20)

PDF
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Introduction to Apache Spark Ecosystem
PDF
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
PPTX
Introduction to Hive
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Spark is going to replace Apache Hadoop! Know Why?
PPTX
Big data spain keynote nov 2016
PPTX
Hive ACID Apache BigData 2016
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PDF
Apache Spark 101
PDF
2016 spark survey
PPTX
Big data Processing with Apache Spark & Scala
PPTX
Big Data Trend with Open Platform
PDF
Data Science with Apache Spark - Crash Course - HS16SJ
PDF
PySpark Best Practices
PDF
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Fast Data Analytics with Spark and Python
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Introduction to Apache Spark Ecosystem
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
Introduction to Hive
Apache spark sneha challa- google pittsburgh-aug 25th
Spark is going to replace Apache Hadoop! Know Why?
Big data spain keynote nov 2016
Hive ACID Apache BigData 2016
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Apache Spark 101
2016 spark survey
Big data Processing with Apache Spark & Scala
Big Data Trend with Open Platform
Data Science with Apache Spark - Crash Course - HS16SJ
PySpark Best Practices
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Hive Training -- Motivations and Real World Use Cases
Fast Data Analytics with Spark and Python
Python and Bigdata - An Introduction to Spark (PySpark)
Ad

Similar to Apache Spark Usage in the Open Source Ecosystem (20)

PDF
A Jupyter kernel for Scala and Apache Spark.pdf
PPTX
Introduction to Scala
PPTX
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
PPTX
Semantic web tools
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Scalable Scientific Computing with Dask
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Rust with-kafka-07-02-2019
PPTX
R introduction
PDF
Towards a Commons RDF Library - ApacheCon Europe 2014
PPTX
The ExtremeEarth infrastructure-phiweek19
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PDF
Apache Spark Tutorial
PPTX
AI and Spark - IBM Community AI Day
PPTX
Overview of Apache Spark and PySpark.pptx
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
A Jupyter kernel for Scala and Apache Spark.pdf
Introduction to Scala
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
Semantic web tools
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Big_data_analytics_NoSql_Module-4_Session
Scalable Scientific Computing with Dask
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Rust with-kafka-07-02-2019
R introduction
Towards a Commons RDF Library - ApacheCon Europe 2014
The ExtremeEarth infrastructure-phiweek19
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache Spark Tutorial
AI and Spark - IBM Community AI Day
Overview of Apache Spark and PySpark.pptx
Transitioning Compute Models: Hadoop MapReduce to Spark
Apache Spark's MLlib's Past Trajectory and new Directions

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Introduction to Artificial Intelligence
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
How Creative Agencies Leverage Project Management Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Nekopoi APK 2025 free lastest update
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Introduction to Artificial Intelligence
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ISO 45001 Occupational Health and Safety Management System
ManageIQ - Sprint 268 Review - Slide Deck
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Wondershare Filmora 15 Crack With Activation Key [2025
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Operating system designcfffgfgggggggvggggggggg
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo Companies in India – Driving Business Transformation.pdf
How to Migrate SBCGlobal Email to Yahoo Easily

Apache Spark Usage in the Open Source Ecosystem

  • 1. Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki
  • 2. About me • Software Engineer /part-time Data Scientist atDatabricks • I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR and Rnotebooks at Databricks 2
  • 4. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  • 5. Databricks Community Edition • In February Databricks launched a free version of its cloud based platform in beta • Since then more than 8,000 users registered • Users created over 61,000 notebooks indifferent languages • This is an analysis of third party libraries that our beta users imported to complement Apache Spark in Scala, Python, and R 5
  • 6. What % of users use other libraries Language % users importing external libs Average # libs Median # libs Python 75 % 9 2 Scala 55 % 3 1 R 57 % 6 1 6
  • 9. Most popular Python packages 9
  • 11. What are these? ETL • re • datetime • pandas • json • csv • string • math /operator • urllib /urllib2 11 Visualization • matplotlib • ggplot • seaborn Advanced analytics • numpy • sklearn • graphframes • tensorflow • scipy Other • test_helper • os • md5
  • 13. What packages go together? 13
  • 15. Most popular Scala libraries 15
  • 16. What are these? ETL • java/scala util • scala.collection • scala.math • java.{io, nio} • java.text • o.a.commons • kafka • twitter4j 16 Visualization • ? Advanced analytics • spark.ml • graphframes Other • java.net • scala.sys
  • 18. What libraries go together? 18
  • 20. Most popular R packages 20
  • 21. What are these? ETL • dplyr • plyr • reshape2 • jsonlite • tidyr • lubridate • httr • data.table 21 Visualization • ggplot2 • beanplot • plotly • ... Advanced analytics • sparkr • h2o • caret • e1071 Other • devtools • magrittr
  • 24. Languages have unique features 24 Scala/ Python / R R / Python Scala / Python/ R • 25 % of users,use multiple languages • 3% of notebooks mix different languages
  • 25. Summary • Spark users extensively mix itwith other packages in different languages – One ofgoals ofSpark project is working well with other projects • ETL related libraries are the most popular category – Opportunities for newdata sources • Notebooks are being used for “small data” aswell as“big data.” • Languages and their ecosystems have diverse capabilities. Users seem to be mixing languages to their advantage – Scala is missing visualization libraries 25
  • 26. Try your favorite library in Databricks 26 http://databricks.com/ce Try latest version of Apache Spark and previewof Spark 2.0
  • 28. What packages are used together? 28