SlideShare a Scribd company logo
Spark’s Role in the Big Data Ecosystem 
Matei Zaharia
An Exciting Year for Spark 
Very fast community growth 
1.0 release in May 
7+ distributors, 20+ apps
Project Activity 
June 2013 
June 2014 
total 
contributors 
68 
255 
companies 
contributing 
17 
50 
total lines" 
of code 
63,000 
175,000
Project Activity 
June 2013 
June 2014 
total 
contributors 
68 
255 
companies 
contributing 
17 
50 
total lines" 
of code 
63,000 
175,000
Compared to Other Projects 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
1400 
1200 
1000 
800 
600 
400 
200 
0 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
300000 
250000 
200000 
150000 
100000 
50000 
0 
Commits 
Lines of Code Changed 
Activity in past 6 months
Compared to Other Projects 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
1400 
1200 
1000 
800 
600 
400 
200 
0 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
300000 
250000 
200000 
150000 
100000 
50000 
0 
Commits 
Lines of Code Changed 
Spark is now the most active project in the" 
Hadoop ecosystem 
Activity in past 6 months
Compared to Other Projects 
Spark is one of top 3 most active projects at Apache 
More active than “general” data processing projects 
like NumPy, matplotlib, SciKit-Learn
Continuing Growth 
source: ohloh.net 
Contributors per month to Spark
Major new additions
Last Summit 
Last Summit we said we’d focus on two things: 
• Standard libraries 
• Enterprise features 
New libraries: Spark SQL, MLlib (machine learning), 
GraphX (graph processing) 
Enterprise features: security, monitoring, HA
Spark SQL 
Enables loading & querying structured data in Spark 
From Hive: 
c = HiveContext(sc)! 
rows = c.sql(“select text, year from hivetable”)! 
rows.filter(lambda r: r.year > 2013).collect()! 
{“text”: “hi”, 
“user”: { 
“name”: “matei”, 
“id”: 123 
}} 
From JSON: 
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)! 
c.sql(“select text, user.name from tweets”)! 
tweets.json
Spark SQL 
Integrates closely with Spark’s language APIs 
c.registerFunction(“hasSpark”, lambda text: “Spark” in text)! 
c.sql(“select * from tweets where hasSpark(text)”)! 
Uniform interface for data access 
44 contributors in 
past year 
Hive 
Parquet 
JSON 
Cassan-dra 
… 
SQL 
Python 
Scala 
Java
Machine Learning Library (MLlib) 
Standard library of machine learning algorithms 
Now includes 15+ algorithms 
• New in 1.0: decision trees, SVD, PCA, L-BFGS 
• In development: non-negative matrix factorization, LDA, 
Lanczos, multiclass trees, ADMM 
points = context.sql(“select latitude, longitude from tweets”)! 
model = KMeans.train(points, 10)! 
! 
40 contributors in 
past year
Java 8 API 
Enables concise programming in Java similar to 
Scala and Python 
JavaRDD<String> lines = sc.textFile("data.txt");! 
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());! 
int totalLength = lineLengths.reduce((a, b) -> a + b);!
What is our vision for Spark?
1. Unified Platform for Big Data Apps 
Batch 
Interactive 
Streaming 
Hadoop 
Cassandra 
Mesos 
… 
Uniform API for diverse workloads over diverse 
storage systems and runtimes 
… 
Cloud 
Providers 
…
Why a Platform Matters 
Good for developers: one system to learn 
Good for users: take apps anywhere 
Good for distributors: more applications
2. Standard Library for Big Data 
Big data apps lack libraries" 
of common algorithms 
Spark’s generality + support" 
for multiple languages make it" 
suitable to offer this 
Python 
Scala 
Java 
R 
SQL 
ML 
graph 
Core 
… 
Much of future activity will be in these libraries
Databricks & Spark 
At Databricks, we are working to keep Spark 100% 
open source and compatible across vendors 
All our work on Spark is at Apache 
Check out project-specific talks to see what’s next!
Thank You and Enjoy Spark Summit!

More Related Content

PDF
What is new in Apache Hive 3.0?
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Introduction to MLflow
PDF
Learn to Use Databricks for Data Science
PPTX
Databricks Fundamentals
PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Introduction to Azure Databricks
PPTX
Databricks for Dummies
What is new in Apache Hive 3.0?
Unified Big Data Processing with Apache Spark (QCON 2014)
Introduction to MLflow
Learn to Use Databricks for Data Science
Databricks Fundamentals
DW Migration Webinar-March 2022.pptx
Introduction to Azure Databricks
Databricks for Dummies

What's hot (20)

PPTX
Big data architectures and the data lake
PPTX
Databricks Platform.pptx
PDF
Future of Data Engineering
PDF
Building a Data Lake on AWS
PDF
Modernizing to a Cloud Data Architecture
PDF
Learn to Use Databricks for the Full ML Lifecycle
PPTX
Cloudera Hadoop Distribution
PPTX
Azure data platform overview
PPTX
Data analytics and powerbi intro
PPTX
Data Lakehouse Symposium | Day 4
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
Power BI Full Course | Power BI Tutorial for Beginners | Edureka
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
PDF
Test strategies for data processing pipelines
Big data architectures and the data lake
Databricks Platform.pptx
Future of Data Engineering
Building a Data Lake on AWS
Modernizing to a Cloud Data Architecture
Learn to Use Databricks for the Full ML Lifecycle
Cloudera Hadoop Distribution
Azure data platform overview
Data analytics and powerbi intro
Data Lakehouse Symposium | Day 4
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Building Lakehouses on Delta Lake with SQL Analytics Primer
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
Building Modern Data Platform with Microsoft Azure
Power BI Full Course | Power BI Tutorial for Beginners | Edureka
Dynamic Rule-based Real-time Market Data Alerts
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Test strategies for data processing pipelines
Ad

Viewers also liked (20)

PDF
Temporal Databases: Data Models
PDF
JupyterHub for Interactive Data Science Collaboration
PDF
Jupyter, A Platform for Data Science at Scale
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
Big data ecosystem
PPT
Temporal
PPTX
Bde euro proworkshop
PDF
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
PDF
Temporal database
PPTX
The Big Data Ecosystem for Financial Services
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PPTX
The Big Data Ecosystem at LinkedIn
PDF
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
PPTX
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
The Ecosystem is too damn big
PDF
Overview - IBM Big Data Platform
PDF
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
PDF
Big data landscape v 3.0 - Matt Turck (FirstMark)
PDF
The Rise of the CDO in Today's Enterprise
Temporal Databases: Data Models
JupyterHub for Interactive Data Science Collaboration
Jupyter, A Platform for Data Science at Scale
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big data ecosystem
Temporal
Bde euro proworkshop
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Temporal database
The Big Data Ecosystem for Financial Services
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
The Big Data Ecosystem at LinkedIn
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
The Ecosystem is too damn big
Overview - IBM Big Data Platform
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Big data landscape v 3.0 - Matt Turck (FirstMark)
The Rise of the CDO in Today's Enterprise
Ad

Similar to Spark's Role in the Big Data Ecosystem (Spark Summit 2014) (20)

PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
BDTC2015 databricks-辛湜-state of spark
PPT
An Introduction to Apache spark with scala
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Why spark by Stratio - v.1.0
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
The BDAS Open Source Community
PDF
Dev Ops Training
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Present and future of unified, portable, and efficient data processing with A...
PDF
Big data apache spark + scala
PDF
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
PDF
Started with-apache-spark
PDF
New directions for Apache Spark in 2015
PDF
Spark + AI Summit 2020 イベント概要
PPTX
Koalas: Unifying Spark and pandas APIs
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Spark Community Update - Spark Summit San Francisco 2015
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Composable Parallel Processing in Apache Spark and Weld
BDTC2015 databricks-辛湜-state of spark
An Introduction to Apache spark with scala
Spark streaming State of the Union - Strata San Jose 2015
Why spark by Stratio - v.1.0
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
The BDAS Open Source Community
Dev Ops Training
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Big Data Processing with .NET and Spark (SQLBits 2020)
Present and future of unified, portable, and efficient data processing with A...
Big data apache spark + scala
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Started with-apache-spark
New directions for Apache Spark in 2015
Spark + AI Summit 2020 イベント概要
Koalas: Unifying Spark and pandas APIs
H2O World - H2O Rains with Databricks Cloud
Spark Community Update - Spark Summit San Francisco 2015
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Quality review (1)_presentation of this 21
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Global journeys: estimating international migration
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Computer network topology notes for revision
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Lecture1 pattern recognition............
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
Launch Your Data Science Career in Kochi – 2025
Quality review (1)_presentation of this 21
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Global journeys: estimating international migration
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Computer network topology notes for revision
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Lecture1 pattern recognition............

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

  • 1. Spark’s Role in the Big Data Ecosystem Matei Zaharia
  • 2. An Exciting Year for Spark Very fast community growth 1.0 release in May 7+ distributors, 20+ apps
  • 3. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines" of code 63,000 175,000
  • 4. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines" of code 63,000 175,000
  • 5. Compared to Other Projects MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months
  • 6. Compared to Other Projects MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Spark is now the most active project in the" Hadoop ecosystem Activity in past 6 months
  • 7. Compared to Other Projects Spark is one of top 3 most active projects at Apache More active than “general” data processing projects like NumPy, matplotlib, SciKit-Learn
  • 8. Continuing Growth source: ohloh.net Contributors per month to Spark
  • 10. Last Summit Last Summit we said we’d focus on two things: • Standard libraries • Enterprise features New libraries: Spark SQL, MLlib (machine learning), GraphX (graph processing) Enterprise features: security, monitoring, HA
  • 11. Spark SQL Enables loading & querying structured data in Spark From Hive: c = HiveContext(sc)! rows = c.sql(“select text, year from hivetable”)! rows.filter(lambda r: r.year > 2013).collect()! {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”)! c.sql(“select text, user.name from tweets”)! tweets.json
  • 12. Spark SQL Integrates closely with Spark’s language APIs c.registerFunction(“hasSpark”, lambda text: “Spark” in text)! c.sql(“select * from tweets where hasSpark(text)”)! Uniform interface for data access 44 contributors in past year Hive Parquet JSON Cassan-dra … SQL Python Scala Java
  • 13. Machine Learning Library (MLlib) Standard library of machine learning algorithms Now includes 15+ algorithms • New in 1.0: decision trees, SVD, PCA, L-BFGS • In development: non-negative matrix factorization, LDA, Lanczos, multiclass trees, ADMM points = context.sql(“select latitude, longitude from tweets”)! model = KMeans.train(points, 10)! ! 40 contributors in past year
  • 14. Java 8 API Enables concise programming in Java similar to Scala and Python JavaRDD<String> lines = sc.textFile("data.txt");! JavaRDD<Integer> lineLengths = lines.map(s -> s.length());! int totalLength = lineLengths.reduce((a, b) -> a + b);!
  • 15. What is our vision for Spark?
  • 16. 1. Unified Platform for Big Data Apps Batch Interactive Streaming Hadoop Cassandra Mesos … Uniform API for diverse workloads over diverse storage systems and runtimes … Cloud Providers …
  • 17. Why a Platform Matters Good for developers: one system to learn Good for users: take apps anywhere Good for distributors: more applications
  • 18. 2. Standard Library for Big Data Big data apps lack libraries" of common algorithms Spark’s generality + support" for multiple languages make it" suitable to offer this Python Scala Java R SQL ML graph Core … Much of future activity will be in these libraries
  • 19. Databricks & Spark At Databricks, we are working to keep Spark 100% open source and compatible across vendors All our work on Spark is at Apache Check out project-specific talks to see what’s next!
  • 20. Thank You and Enjoy Spark Summit!