SlideShare a Scribd company logo
APACHE SPARK
SURVEY 2016
REPORT
® ™
Table of Contents
Introduction	3
Foreword: Matei Zaharia	 4
REPORT HIGHLIGHTS	 5
APACHE SPARK’S GROWTH CONTINUES	 13
The Apache Spark Community is Growing 	 14
Spark’s Fastest Growing Areas from 2015 to 2016 	 17
Spark Users are Growing 	 18
Spark Users Employ Multiple Languages 	 19
Spark Components Used in Production	 20
Spark is Used Widely in Organizations	 21
Users Solve Complex Problems	 22
Users Employ Multiple Components	 23
What Users Consider Important	 24
Top Three Storage Technologies 	 25
Section Summary	 26
APACHE SPARK IN THE CLOUD IS GROWING	 27
Trend: Increase in Public Cloud Deployments	 28
Trend: Percentage Decrease in On-Premises Deployments 	 29
Section Summary	 30
APACHE SPARK STREAMING AND MACHINE LEARNING
SURGE IN USAGE 	31
Apache Spark Streaming is Growing	 32
Apache Spark Streaming Engine is the Preferred Choice	 34
Section Summary	 35
Afterword: Reynold Xin	 36
About Databricks	 37
2
SPARK SURVEY 2016
Introduction
In July 2016, Databricks conducted an Apache® Spark™ Survey to
identify insights into how organizations are using Spark as well
as highlight growth trends since the last Spark Survey 2015. In
this report, the results reflect answers from over 900 distinct
organizations and 1615 respondents, who were predominantly
Apache Spark users.
As in 2015, which was a tremendous year in growth for Apache Spark,
this year, too, its growth remains unabated—not only in areas like
the public cloud, but also with the increased use of Spark Streaming
and the use of Machine Learning. 2016 also shows Spark’s robust
adoption across a variety of organizations and users from many
functional roles to build complex solutions, using multiple Spark
components. Of the roles represented in the survey, 41% identified
themselves as data engineers, while 23% as data scientists and 21%
as architects; the rest of the 10% came from technical management
and 5% from academia.
	
1615RESPONDENTS
900DISTINCT ORGANIZATIONS
DATA ENGINEERS
ARCHITECTS
TECHNICAL
MANAGEMENT
ACADEMICS
DATA SCIENTISTS
41%
21%
10%
5%
23%
3
Foreword: Matei Zaharia
I’m delighted to share the results of this year’s Databricks Apache
Spark Survey. As I noted in the previous Spark Survey 2015, we
witnessed a rapid adoption of Spark and the precipitous growth
of the Spark community. And this year’s Spark’s growth trajectory
and trends continue. In particular, I’m excited to see more Spark
deployments in the cloud and more interest in people building real-
time applications using Spark Streaming with multiple components,
such as Machine Learning. Given that Apache Spark 2.0 lays the
foundational steps for Structured Streaming, by providing simplified
and unified APIs to write end-to-end streaming applications called
continuous applications, I anticipate this interest will surge further in
the coming months—with subsequent releases of Spark.
Since its inception, Spark’s core mission has been to make Big
Data simple and accessible for everyone—for organizations of all
sizes and across all industries. And we have not deviated from that
mission. In Apache Spark 2.0, we strived to make Spark easier, faster
and smarter. And we remain committed to our vision of simplicity.
Seventy-six percent of respondents in this survey indicate ease-of-
programing as one of the most important features of Spark.
Since its inception, Spark’s core mission has been to
make Big Data simple and accessible for everyone—
for organizations of all sizes and across all industries.
And we have not deviated from that mission...
M A T E I Z A H A R I A
Chief Technologist at Databricks,
VP of Apache Spark at the Apache Software Foundation
@matei_zaharia
			
Spark’s growth continues across various industries building complex data
solutions by people in various functional roles. It has moved well beyond
the early-adopter phase at tech companies and is now mainstream in
large data-driven enterprises.
4
TOP THREE APACHE SPARK TAKEAWAYS
REPORT HIGHLIGHTS
SPARK STREAMING AND MACHINE
LEARNING SURGE IN USAGE
SPARK’S GROWTH
CONTINUES
SPARK IN THE CLOUD IS GROWING
5
REPORT HIGHLIGHTS
This year the growth trend continues in the
community. Increased growth of Apache Spark
Meetup members, a jump in Spark Summit
attendees, more code contributors, and a surge
in companies represented at the Spark Summit
(from several vertical industries) suggest a
growing and thriving Spark community.
67%
CODE
CONTRIBUTORS
240%
SPARK MEETUP
MEMBERS
2016
1000
2015
600
2016
225,000
2015
66,000
NOTABLE SPARK USERS WHO PRESENTED AT SPARK SUMMIT 2016
57%
NUMBER OF COMPANIES
AT SUMMITS
2016
1800
2015
1144
30%
SPARK SUMMIT
ATTENDEES
2016
5100
2015
3912
6
REPORT HIGHLIGHTS
Asked what Apache Spark components developers use to build complex solutions
for their use cases, 74% of respondents said they use two or more components
to build different types of products.
74%
USE TWO OR MORE
COMPONENTS
of respondents
64%
USE THREE OR
MORE COMPONENTS
of respondents
NUMBER OF COMPONENTS USEDTYPES OF PRODUCTS BUILT
68%
52%
45%
40%
37%
36%
29%
BUSINESS / CUSTOMER INTELLIGENCE
DATA WAREHOUSING
REAL-TIME / STREAMING SOLUTIONS
RECOMMENDATION ENGINES
LOG PROCESSING
USER-FACING SERVICES
FRAUD DETECTION / SECURITY
% of respondents who use Spark to create each product (more than one product could be selected)
7
REPORT HIGHLIGHTS
LANGUAGES USED IN SPARK YEAR-OVER-YEAR
18% 20%
2015 2016
R
36% 44%
2015 2016
SQL
31%
29%
JAVA
2015 2016
58%
62%
PYTHON
2015 2016
71%
65%
SCALA
2015 2016
% of respondents who use each language (more than one language could be selected)
SPARK COMPONENTS USED IN PRODUCTION YEAR-OVER-YEAR
% of respondents who use each component in production (more than one component could be selected)
SQL
24%
40%
2015 2016
DATAFRAMES
15%
38%
2015 2016
STREAMING
14%
22%
2015 2016
ADVANCED
ANALYTICS
(MLlib)
13% 18%
2015 2016
In addition to using multiple Apache Spark components, many respondents indicated that they
use multiple programing languages in Spark. They also are using multiple components in
production, including increased use of Spark Streaming and MLlib.
8
REPORT HIGHLIGHTS
APACHE SPARK’S FASTEST GROWING AREAS IN 2016
57%
STREAMING
USERS
2016
22%
OF RESPONDENTS
2015
14%
OF RESPONDENTS
38%
ADVANCED ANALYTICS
USERS (MLlib)
2016
18%
OF RESPONDENTS
2015
13%
OF RESPONDENTS
153%
DATAFRAME
USERS
2016
38%
OF RESPONDENTS
2015
15%
OF RESPONDENTS
67%
SPARK SQL
USERS
2016
40%
OF RESPONDENTS
2015
24%
OF RESPONDENTS
* * * *
*component used in production
9
REPORT HIGHLIGHTS
APACHE SPARK
DEPLOYMENT
IN PUBLIC CLOUDS
INCREASED BY 10%
SINCE 2015.
51% of users in the 2015 Spark Survey said they
deployed Apache Spark in the public cloud,
compared with 61% of users in 2016, showing
a growth of 20%.
51%
2015
of respondents deployed
in a public cloud
2016
of respondents deploy
in a public cloud
61%
While Apache Spark deployments in the public
cloud increased in 2016, the percentage of Spark
deployments on-premises decreased. For
example, 48% of users in 2015 Spark survey and
42% in 2016 survey said they used Standalone
cluster managers for their on-premises Spark
deployments, showing a 13% percentage decrease.
Similarly, YARN and Mesos show 10% and 36%
percentage decreases respectively in deployments.
2015 2016
40% 48%
36% 42%
2015 2016
STANDALONEYARN
ON-PREMISES DEPLOYMENTS YEAR-OVER-YEAR
% of respondents who use each (more than one deployment could be selected)
11%
7%
2015 2016
MESOS
10
Investments in fast data analytics has surged,
according to Datanami. Since companies
are shifting investments from batch to
real-time applications, respondents in this
survey show an affinity toward building real-
time applications using the Spark Streaming
framework.
Among all the streaming engines, 33% of
respondents said they were heavy users of
Spark Streaming.
REPORT HIGHLIGHTS
51%
35%
S O M E W H A T
I M P O R T A N T
N O T
I M P O R T A N T
14%
of respondents
CONSIDER APACHE
SPARK STREAMING
VERY IMPORTANT
33% of respondents
USE APACHE SPARK
STREAMING A LOT
11
Respondents indicated that Spark Streaming
is very important for building real-time
streaming, recommendation engines, and
fraud detection applications.
Machine Learning has seen an increase in
production usage.
MLlib USE IN PRODUCTION
% of respondents who use the component in production
REPORT HIGHLIGHTS
40%
of respondents develop
RECOMMENDATION
ENGINE PRODUCTS
of respondents develop
REAL-TIME STREAMING
PRODUCTS
45%29%
of respondents develop
FRAUD DETECTION /
SECURITY PRODUCTS
Q: WHICH KINDS OF PRODUCTS DOES YOUR
ORGANIZATION DEVELOP? Select all that apply.
13%
18%
2015 2016
38%
ADVANCED ANALYTICS
PRODUCTION CASES
12
APACHE SPARK’S
GROWTH
CONTINUES
13
The Apache Spark
Community is Growing
The section identifies key growth areas in all aspects
of Spark that are propelling this uptake. Both 2015
and 2016 have seen a tremendous growth in the
Spark community and Spark usage in many vertical
industries.
Spark today remains the most active open source
project in Big Data. Today, there are over 1000
Spark contributors, compared to 600 in 2015 from
250+ organizations. With such large numbers of
contributors and organizations investing in Spark’s
future development, it has engaged a community
of developers globally. The Apache Spark Meetup
groups’ membership continues to flourish, both
nationally and internationally.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
240%
SPARK MEETUP MEMBERS
67%
CODE CONTRIBUTORS
2016
1000
2015
600
2016
225,000
2015
66,000
14
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
57%
30%
COMPANIES REPRESENTED
AT SUMMITS
SPARK SUMMIT
ATTENDEES
2016
1800
2016
5100
2015
1144
2015
3912
Every year, more users attend Spark Summit, the
largest dedicated conference to the Apache Spark
project. In 2016 there has been an increased number
of attendees from a broad range of organizations
attending this event, with attendees ranging from
developers to data scientists and engineers; to
business users and analysts; and executive level
decision makers. A number of notable users
presented how they use Spark at the Spark Summit
San Francisco 2016.
NOTABLE SPARK USERS WHO PRESENTED AT SPARK SUMMIT 2016
15
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
4 RELEASES IN 2015
1.2, 1.3, 1.4, 1.5
2 MAJOR
RELEASES IN 2016
1.6, 2.0
75%
18%
7%
USE SPARK 1.6 USE SPARK 2.0
OTHER
In just two years, the Spark community has
released six Spark releases. When asked which
version of Apache Spark they are using, 75%
responded that they are using Spark 1.6, while 18%
are using Spark 2.0 (respondents could choose
multiple releases, such as 1.3, or 1.4 or 1.5).
as of September 2016
16
ADVANCED ANALYTICS USERS (MLlib)
IN PRODUCTION
Spark’s Fastest Growing
Areas from 2015 to 2016
Spark Streaming, in particular, has taken
a notable increase in its usage, so has SQL,
MLlib, and Windows users from 2015.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
57%
STREAMING USERS IN PRODUCTION
2016
22%
OF RESPONDENTS
2015
14%
OF RESPONDENTS
153%
DATAFRAME USERS IN PRODUCTION
2016
38%
OF RESPONDENTS
2015
15%
OF RESPONDENTS 67%
SPARK SQL USERS IN PRODUCTION
2016
40%
OF RESPONDENTS
2015
24%
OF RESPONDENTS
38% 2016
18%
OF RESPONDENTS
2015
13%
OF RESPONDENTS
39%
WINDOWS USERS IN DEVELOPMENT
2016
32%
OF RESPONDENTS
2015
23%
OF RESPONDENTS
17
Spark Users are Growing
Spark is attractive not only to highly-skilled and
technically advanced users. It crosses barriers,
and other users such as business analysts
increasingly use Spark and develop Spark-based
applications in environments other than Linux.
From last year, the percentage of Windows users
employing Spark has increased.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
LINUX / UNIX
75%
74%
2015 2016
DEVELOPMENT ENVIRONMENTS
WINDOWS
23%
32%
2015 2016
MAC OSX
14%
2015 2016
22%
39%
WINDOWS USERS
YEAR-OVER-YEAR
% of respondents who use each development environment (more than one environment could be selected)
18
71%
65%
2015 2016
SCALA
Spark Users Employ
Multiple Languages
Spark is becoming the key data processing and
computing platform used by a broad range of users.
These users span many vertical industries and use
a variety of programming languages. One reason for
this broad adoption is because Spark is easy to use
and supports familiar programming APIs across
these languages.
Usage of Spark in Python, SQL, and R increased,
while Scala and Java usage decreased. This
indicates that more data analysts are drawn
to Spark from areas other than pure data
engineering, suggesting that Spark usage is
expanding to new and diverse users.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
WHICH LANGUAGES DO YOU USE SPARK IN?Q:
58% 62%
2015 2016
PYTHON
31%
29%
2015 2016
JAVA
18% 20%
2015 2016
R
36% 44%
2015 2016
SQL
% of respondents who use each language (more than one language could be selected)
19
Spark Components Used
in Production
Since last year, the use of Spark components in
production has increased, especially in Spark
Streaming and advanced analytics with Apache
Spark MLlib (machine learning). This corroborates
with the observation in this report about increased
interest among Spark users to build real-time
streaming applications with Spark Streaming,
using multiple components, including MLlib.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
DATAFRAMES
15%
38%
2015 2016
SQL
24%
40%
2015 2016
STREAMING
14%
22%
2015 2016
ADVANCED
ANALYTICS (MLlib)
13%
18%
2015 2016
WHICH COMPONENTS OF THE
APACHE SPARK STACK ARE YOU USING?Q:
153%
57%
11%
STREAMING
USERS
ADVANCED ANALYTICS
USERS
67%
SQL
USERS
% of respondents who use each component in production (more than one component could be selected)
DATAFRAMES
USERS
20
WHAT INDUSTRY VERTICAL
BEST DESCRIBES YOUR
ORGANIZATION?
Spark is Used Widely
in Organizations
Spark’s adoption continues to grow across varied
industries because of its unified engine, and because
of its proven performance and versatility that
enables it to process diverse workloads.
The banking sector saw the highest percentage
change in the usage of Spark since 2015, as did the
Health, Medical, Biotech and Pharmacy verticals.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
18%
CONSULTING
(IT)
25%
SOFTWARE
(SAAS, WEB, MOBILE)
11%
BANKING /
FINANCE
7%
ADVERTISING /
MARKETING /
PR
6%
ECOMMERCE / RETAIL
5%
HEALTH / MEDICAL /
PHARMACY / BIOTECH
CARRIERS / TELECOM
5%
4%
3%
EDUCATION
PUBLISHING / MEDIA
COMPUTERS / HARDWARE
3% 13%
OTHER
Q:
29%
CONSULTING (IT)
USERS
39%
HEALTH / MEDICAL /
PHARMACY / BIOTECH USERS
63%
BANKING
USERS
2016
10.58%
2016
5.42%
2016
18.09%
2015
6.48%
2015
3.89%
2015
13.98%
Percentages rounded to the nearest integer.
21
APACHE SPARK’S GROWTH CONTINUES
Users Solve Complex
Problems
Users are solving complex data problems across
varied industry verticals, as Spark’s unified platform
enables users to build complex solutions using
multiple Spark components for their multiple
data workloads.
68%
52%
45%
40%
37%
36%
29%
BUSINESS / CUSTOMER INTELLIGENCE
DATA WAREHOUSING
REAL-TIME / STREAMING SOLUTIONS
RECOMMENDATION ENGINES
LOG PROCESSING
USER-FACING SERVICES
FRAUD DETECTION / SECURITY
WHICH KINDS OF PRODUCTS DOES
YOUR ORGANIZATION DEVELOP?Q: Select all that apply.
22
31%

APACHE SPARK’S GROWTH CONTINUES
Users Employ Multiple
Components
Because of Spark’s unified engine and its ability
to process multiple workloads within the same
cluster, many Spark users within organizations use
multiple components of Spark for their use cases
and their respective workloads.
Not only are Spark components used separately;
two or more components are often used in
prototyping and production. This unification
blurs the barriers between data scientists, data
engineers, and data analysts—all using the same
unified compute engine.
COMPONENTS USED IN PROTOTYPING
AND PRODUCTION
DATASETS
14%
43%
67%
43%
67%
74%
USE TWO OR MORE
COMPONENTS
of Spark users
64%
USE THREE OR
MORE COMPONENTS
of Spark users
GRAPHX
MLlib
SPARK SQL
SPARK STREAMING
DATAFRAMES
More than one component could be selected.
23
APACHE SPARK’S GROWTH CONTINUES
What Users Consider
Important
Users are drawn to Spark for a number of reasons:
it’s easier to get started quickly because of simple and
consistent APIs; it’s faster because of improvements
in Apache Spark 2.0; and it’s smarter because of
simplified Structured Streaming APIs, allowing users
to build end-to-end continuous applications.
According to our 2015 Spark Survey, 91% of users
consider performance as the most important
aspect of Apache Spark, along with ease of
programming, real-time streaming and advanced
analytics. In this year’s survey, Spark users reflect
these as equally important.
At the time of this survey, Apache Spark
2.0 had just been officially released, and
users displayed a keen interest in using
it. Even though most users run Spark 1.6,
the 2016 survey results suggest they had
quickly started using Spark 2.0.
% OF RESPONDENTS WHO CONSIDERED THE FEATURE
VERY IMPORTANT
PERFORMANCE
91% EASE OF
PROGRAMMING
76%EASE OF
DEPLOYMENT
69%
ADVANCED
ANALYTICS
82%
REAL-TIME
STREAMING
51%
RUN SPARK 1.6
75%
RUN
SPARK 2.0
18%
More than one feature could be selected.
24
73%
WHICH OF THESE TECHNOLOGIES DO YOU
CURRENTLY USE?
58%
82%
of respondents use
KEY-VALUE STORES (NoSQL)
of respondents use
OPEN-SOURCE SQL DATABASES
SPARKAPACHE SPARK’S GROWTH CONTINUES
Top Three Storage
Technologies
A large number of Spark users use technologies
for storage other than Apache®
Hadoop®
, such as
Cassandra, MongoDB and NoSQL as well as other
open-source and proprietary SQL data stores.
Q:
of respondents use
PROPRIETARY SQL DATABASES
Select all that apply.
25
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES
Section Summary
Apache Spark’s growth and adoption continues as users, industries, development
environments, disciplines, and programming languages embrace its ease of use and
programming, its unified compute engine, and its performance to solve complex data
problems at scale. Spark allows multiple components to work on multiple workloads
and access data from multiple data sources. All of these factors make Spark an attractive
choice as a unified compute data platform.
26
APACHE SPARK IN THE
CLOUD IS GROWING
27
2016
51%
2015
Trend: Increase in Public Cloud
Deployments
The rise of cloud computing is rapid, inexorable and
causing a huge upheaval in the tech industry, writes
The Economist. “Gartner estimates that about $205
billion, or 6% of the world’s IT budget of $3.4 trillion,
will be spent on cloud computing in 2016—a number
it expects to grow to $240 billion next year,” according
to another article in The Economist.
This survey reflects this trend, as many respondents
are electing to deploy Spark in the public cloud,
mitigating both cost and infrastructure headaches.
Since 2015, we have seen a 20% growth of users
deploying Spark in the public cloud. That is, 61% users
in the 2016 survey said they deployed Spark in the public
cloud compared to 51% in 2015.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK IN THE CLOUD IS GROWING
SPARK DEPLOYMENT IN PUBLIC CLOUDS
HAS INCREASED BY 10% SINCE 2015.
2016
61% of respondents
deploy Spark
in a public cloud
28
Trend: Percentage Decrease
in On-Premises Deployments
Although many Spark users run Spark
on-premises alongside Hadoop and other data
sources, some deployment modes in 2016 have
seen a percentage decrease.	
APACHE SPARK STREAMING IS IMPORTANT
2015 2016
WHERE DO YOU RUN SPARK?
Q:
11%
40%
48%
7%
36%
42%
2015 20162015 2016
STANDALONEYARNMESOS
APACHE SPARK IN THE CLOUD IS GROWING
36%
13%
MESOS
SPARK DEPLOYMENTS
STANDALONE
SPARK DEPLOYMENTS
10%
YARN
SPARK DEPLOYMENTS
Select all that apply.
29
Section Summary
Not only do cloud deployments have lower deployment costs and fewer management headaches,
they have higher and proven performance benefits.
Using Apache Spark on 206 EC2 machines, we sorted 100TB of data on disk in
23 minutes. In comparison, the previous world record set by Hadoop MapReduce
used 2100 machines and took 72 minutes. This means that Spark sorted the
same data 3X faster using 10X fewer machines.
APACHE SPARK IN THE CLOUD IS GROWING
R E Y N O L D X I N
Chief Architect & Co-Founder of Databricks
30
APACHE SPARK STREAMING
AND MACHINE LEARNING
SURGE IN USAGE
31
VERY
IMPORTANT
51%
35%
SOMEWHAT
IMPORTANT
NOT
IMPORTANT
14%
Q:
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE
Apache Spark Streaming
is Growing
Since its release, Spark Streaming has become
one of the most widely used distributed
streaming engines. Interest in developing real-time
applications and advanced analytics is on the rise.
Over half of the survey respondents indicate that
streaming is vital and important for developing
valuable real-time streaming, recommendation
engines, and fraud-detection and security solutions.
HOW IMPORTANT IS
SPARK STREAMING
TO YOUR USE CASE?
40%
of respondents develop
RECOMMENDATION
ENGINE PRODUCTS
45%
of respondents develop
REAL-TIME STREAMING
PRODUCTS
29%
of respondents develop
FRAUD DETECTION /
SECURITY PRODUCTS
Q:
WHICH KINDS OF PRODUCTS DOES YOUR
ORGANIZATION DEVELOP? Select all that apply.
32
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE
Organizations use Spark Streaming along with
Spark’s other multiple components to develop
streaming applications. Both Spark Streaming and
MLlib saw a notable increase in production use.
SPARK STREAMING AND MLlib USE IN PRODUCTION
2015 2016
13%
18%
2015 2016
STREAMING ADVANCED ANALYTICS (MLlib)
14%
22%
57%
STREAMING
PRODUCTION CASES
38%
ADVANCED ANALYTICS
PRODUCTION CASES
% of respondents who use the component in production (more than one component could be selected)
33
WHICH OF THESE TECHNOLOGIES DO YOU CURRENTLY
USE A LOT FOR STREAMING AND/OR COMPLEX EVENT
PROCESSING CASES?
Q:
APACHE SPARK STREAMING IS IMPORTANT
1% 	 APACHE APEX	
APACHE SPARK
4% 	 KINESIS	
6%	 APACHE STORM 	
1% 	 APACHE FLINK
29% 	 APACHE KAFKA 	
APACHE SPARK
COMPONENT POPULARITY
% of respondents who use the component
anywhere from evaluation to production
(more than one component could be selected) SPARK STREAMING
71%
MLlib
71%
SQL
88%
RDDS
8383%
DATAFRAMES
89%
33%
DO YOU
CURRENTLY
USE SPARK
STREAMING
IN PRODUCTION?
Q:
used it in 2015
14% +57%
are using it today
SPARK STREAMING
PRODUCTION CASES
22%
Apache Spark
Streaming Engine
is the Preferred Choice
Compared to other streaming engines, Spark
is the preferred choice at 33%.
When compared to other Spark components,
Spark Streaming matches MLlib at 71% in use,
from evaluation to production.
In the 2015 Spark survey, 14% of users said they
used Spark Streaming in production, compared to
22% of users in 2016. Overall, we saw a 57% growth
of users using Spark Streaming in production.
APACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE
Select all that apply.
Note: Respondents were predominately Spark users.
34
Section Summary
Spark Streaming is being used for real-time solutions, from evaluation to production, closer
in usage to Spark’s other commonly used components. As a preferred choice of streaming
engine over others, more organizations are building real-time streaming solutions as they
consider streaming an important Spark feature.
APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE
35
Afterword: Reynold Xin
2015 and 2016 have been exciting years for the adoption and increased
growth of Apache Spark and its community. Two releases—Spark 1.6
and 2.0—have seen major improvements in all aspects of Spark noted
by respondents in this survey as important. I continue to look forward,
and work with the community, to the exciting future ahead for the
Spark platform.
As Spark becomes easier, faster, and smarter, outside the predominantly
IT and Consulting Industry, a newer audience is adopting it, as results from
the survey suggest. Performance, ease-of-use, streaming, and reliability
top the list as most important features. At the time of this survey, we
released Apache Spark 2.0. Ongoing performance improvements, with
Project Tungsten, started in earlier releases and culminated in Spark 2.0.
In addition, Spark 2.0 delivered unified DataFrames and Datasets APIs and
simplified Structured Streaming APIs. All these make Spark an attractive
engine for performing advanced analytics across industry verticals in
solving complex data problems, by users from different functional roles.	
						
Your voice matters. We got an insightful glimpse into the growth and
trends from this year’s survey: who’s using Spark, how they are using it,
what’s important, what new features they use, and what they are using
it for. Just as the feedback from last year’s survey did, these insights will
drive major updates and help shape the future of the Spark platform.
Thank you to everyone who participated in Databricks’ Apache Spark
Survey 2016!
R E Y N O L D X I N
Chief Architect  Co-Founder of Databricks
@rxin
36
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™,
a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache
Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of
customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of
production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact info@databricks.com.
TRY DATABRICKS FOR FREE
databricks.com/try-databricks
CONTACT US FOR A PERSONALIZED DEMO
databricks.com/contact-databricks
© Databricks 2016. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
37

More Related Content

PDF
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Spark is going to replace Apache Hadoop! Know Why?
PDF
Apache Spark Usage in the Open Source Ecosystem
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PDF
Spark SQL | Apache Spark
PDF
Apache Spark 101
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
Apache spark sneha challa- google pittsburgh-aug 25th
Spark is going to replace Apache Hadoop! Know Why?
Apache Spark Usage in the Open Source Ecosystem
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Spark SQL | Apache Spark
Apache Spark 101
Frustration-Reduced PySpark: Data engineering with DataFrames

Viewers also liked (20)

PDF
An introduction To Apache Spark
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PPTX
5 things one must know about spark!
PDF
PySpark Best Practices
PDF
Distributed ML in Apache Spark
PDF
Performance of Spark vs MapReduce
PPTX
Introduction to Apache Spark and MLlib
PPTX
Big data Processing with Apache Spark & Scala
PDF
Machine Learning with Spark MLlib
PPTX
Big Data Trend with Open Platform
PPTX
Online Tweet Sentiment Analysis with Apache Spark
PDF
PySpark in practice slides
PPTX
Programming in Spark using PySpark
PDF
Apache Spark Tutorial
PDF
Machine Learning and GraphX
PDF
Spark DataFrames and ML Pipelines
PDF
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
PDF
Large-Scale Machine Learning with Apache Spark
PDF
MLlib: Spark's Machine Learning Library
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
An introduction To Apache Spark
PySpark Cassandra - Amsterdam Spark Meetup
5 things one must know about spark!
PySpark Best Practices
Distributed ML in Apache Spark
Performance of Spark vs MapReduce
Introduction to Apache Spark and MLlib
Big data Processing with Apache Spark & Scala
Machine Learning with Spark MLlib
Big Data Trend with Open Platform
Online Tweet Sentiment Analysis with Apache Spark
PySpark in practice slides
Programming in Spark using PySpark
Apache Spark Tutorial
Machine Learning and GraphX
Spark DataFrames and ML Pipelines
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Large-Scale Machine Learning with Apache Spark
MLlib: Spark's Machine Learning Library
Python and Bigdata - An Introduction to Spark (PySpark)
Ad

Similar to 2016 spark survey (20)

PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
PPTX
The Evolution of Social & Revolution of Messaging Apps
PDF
IoT Developer Survey 2016
PDF
Spark Summit EU 2015: Matei Zaharia keynote
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
PDF
State of API Integration Report 2017
PDF
Don't Migrate to S/4HANA until you've read this research
PDF
Why change? Why Open Source? Why Red Hat? Why now?
PDF
Partner Managed Cloud for SAP S/4HANA
PDF
You Can’t Live Without Open Source - Results from the Open Source 360 Survey
PPTX
apidays LIVE Hong Kong 2021 - The API Trends for 2022 and beyond by Jimmy Tsa...
PPTX
2014 Future of Open Source - 8th Annual Survey results
PDF
State of the Cloud DevOps Trends
PDF
apidays Australia 2023 - 2023 State of the API Report, Jordan Walsh, Postman
PDF
2023 State of the API Report: Key Findings and Trends
PDF
Modern IT Architecture Survey 2016
PDF
Pre-Con Ed: Hack that API—Your Data, Your Way With CA Performance Management
PPTX
The State of API 2020 Webinar – Exploring Trends, Tools & Takeaways to Drive ...
PDF
'Shift-Right' - Rapid Evolution with DesignOps
PPT
OpenStack 2015 Marketing Plan
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
The Evolution of Social & Revolution of Messaging Apps
IoT Developer Survey 2016
Spark Summit EU 2015: Matei Zaharia keynote
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
State of API Integration Report 2017
Don't Migrate to S/4HANA until you've read this research
Why change? Why Open Source? Why Red Hat? Why now?
Partner Managed Cloud for SAP S/4HANA
You Can’t Live Without Open Source - Results from the Open Source 360 Survey
apidays LIVE Hong Kong 2021 - The API Trends for 2022 and beyond by Jimmy Tsa...
2014 Future of Open Source - 8th Annual Survey results
State of the Cloud DevOps Trends
apidays Australia 2023 - 2023 State of the API Report, Jordan Walsh, Postman
2023 State of the API Report: Key Findings and Trends
Modern IT Architecture Survey 2016
Pre-Con Ed: Hack that API—Your Data, Your Way With CA Performance Management
The State of API 2020 Webinar – Exploring Trends, Tools & Takeaways to Drive ...
'Shift-Right' - Rapid Evolution with DesignOps
OpenStack 2015 Marketing Plan
Ad

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPT
Project quality management in manufacturing
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
737-MAX_SRG.pdf student reference guides
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Artificial Intelligence
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Construction Project Organization Group 2.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
OOP with Java - Java Introduction (Basics)
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mechanical Engineering MATERIALS Selection
Fundamentals of safety and accident prevention -final (1).pptx
Project quality management in manufacturing
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
737-MAX_SRG.pdf student reference guides
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Artificial Intelligence
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
UNIT 4 Total Quality Management .pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Construction Project Organization Group 2.pptx
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Foundation to blockchain - A guide to Blockchain Tech
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Geodesy 1.pptx...............................................
OOP with Java - Java Introduction (Basics)

2016 spark survey

  • 2. Table of Contents Introduction 3 Foreword: Matei Zaharia 4 REPORT HIGHLIGHTS 5 APACHE SPARK’S GROWTH CONTINUES 13 The Apache Spark Community is Growing 14 Spark’s Fastest Growing Areas from 2015 to 2016 17 Spark Users are Growing 18 Spark Users Employ Multiple Languages 19 Spark Components Used in Production 20 Spark is Used Widely in Organizations 21 Users Solve Complex Problems 22 Users Employ Multiple Components 23 What Users Consider Important 24 Top Three Storage Technologies 25 Section Summary 26 APACHE SPARK IN THE CLOUD IS GROWING 27 Trend: Increase in Public Cloud Deployments 28 Trend: Percentage Decrease in On-Premises Deployments 29 Section Summary 30 APACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE 31 Apache Spark Streaming is Growing 32 Apache Spark Streaming Engine is the Preferred Choice 34 Section Summary 35 Afterword: Reynold Xin 36 About Databricks 37 2
  • 3. SPARK SURVEY 2016 Introduction In July 2016, Databricks conducted an Apache® Spark™ Survey to identify insights into how organizations are using Spark as well as highlight growth trends since the last Spark Survey 2015. In this report, the results reflect answers from over 900 distinct organizations and 1615 respondents, who were predominantly Apache Spark users. As in 2015, which was a tremendous year in growth for Apache Spark, this year, too, its growth remains unabated—not only in areas like the public cloud, but also with the increased use of Spark Streaming and the use of Machine Learning. 2016 also shows Spark’s robust adoption across a variety of organizations and users from many functional roles to build complex solutions, using multiple Spark components. Of the roles represented in the survey, 41% identified themselves as data engineers, while 23% as data scientists and 21% as architects; the rest of the 10% came from technical management and 5% from academia. 1615RESPONDENTS 900DISTINCT ORGANIZATIONS DATA ENGINEERS ARCHITECTS TECHNICAL MANAGEMENT ACADEMICS DATA SCIENTISTS 41% 21% 10% 5% 23% 3
  • 4. Foreword: Matei Zaharia I’m delighted to share the results of this year’s Databricks Apache Spark Survey. As I noted in the previous Spark Survey 2015, we witnessed a rapid adoption of Spark and the precipitous growth of the Spark community. And this year’s Spark’s growth trajectory and trends continue. In particular, I’m excited to see more Spark deployments in the cloud and more interest in people building real- time applications using Spark Streaming with multiple components, such as Machine Learning. Given that Apache Spark 2.0 lays the foundational steps for Structured Streaming, by providing simplified and unified APIs to write end-to-end streaming applications called continuous applications, I anticipate this interest will surge further in the coming months—with subsequent releases of Spark. Since its inception, Spark’s core mission has been to make Big Data simple and accessible for everyone—for organizations of all sizes and across all industries. And we have not deviated from that mission. In Apache Spark 2.0, we strived to make Spark easier, faster and smarter. And we remain committed to our vision of simplicity. Seventy-six percent of respondents in this survey indicate ease-of- programing as one of the most important features of Spark. Since its inception, Spark’s core mission has been to make Big Data simple and accessible for everyone— for organizations of all sizes and across all industries. And we have not deviated from that mission... M A T E I Z A H A R I A Chief Technologist at Databricks, VP of Apache Spark at the Apache Software Foundation @matei_zaharia Spark’s growth continues across various industries building complex data solutions by people in various functional roles. It has moved well beyond the early-adopter phase at tech companies and is now mainstream in large data-driven enterprises. 4
  • 5. TOP THREE APACHE SPARK TAKEAWAYS REPORT HIGHLIGHTS SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE SPARK’S GROWTH CONTINUES SPARK IN THE CLOUD IS GROWING 5
  • 6. REPORT HIGHLIGHTS This year the growth trend continues in the community. Increased growth of Apache Spark Meetup members, a jump in Spark Summit attendees, more code contributors, and a surge in companies represented at the Spark Summit (from several vertical industries) suggest a growing and thriving Spark community. 67% CODE CONTRIBUTORS 240% SPARK MEETUP MEMBERS 2016 1000 2015 600 2016 225,000 2015 66,000 NOTABLE SPARK USERS WHO PRESENTED AT SPARK SUMMIT 2016 57% NUMBER OF COMPANIES AT SUMMITS 2016 1800 2015 1144 30% SPARK SUMMIT ATTENDEES 2016 5100 2015 3912 6
  • 7. REPORT HIGHLIGHTS Asked what Apache Spark components developers use to build complex solutions for their use cases, 74% of respondents said they use two or more components to build different types of products. 74% USE TWO OR MORE COMPONENTS of respondents 64% USE THREE OR MORE COMPONENTS of respondents NUMBER OF COMPONENTS USEDTYPES OF PRODUCTS BUILT 68% 52% 45% 40% 37% 36% 29% BUSINESS / CUSTOMER INTELLIGENCE DATA WAREHOUSING REAL-TIME / STREAMING SOLUTIONS RECOMMENDATION ENGINES LOG PROCESSING USER-FACING SERVICES FRAUD DETECTION / SECURITY % of respondents who use Spark to create each product (more than one product could be selected) 7
  • 8. REPORT HIGHLIGHTS LANGUAGES USED IN SPARK YEAR-OVER-YEAR 18% 20% 2015 2016 R 36% 44% 2015 2016 SQL 31% 29% JAVA 2015 2016 58% 62% PYTHON 2015 2016 71% 65% SCALA 2015 2016 % of respondents who use each language (more than one language could be selected) SPARK COMPONENTS USED IN PRODUCTION YEAR-OVER-YEAR % of respondents who use each component in production (more than one component could be selected) SQL 24% 40% 2015 2016 DATAFRAMES 15% 38% 2015 2016 STREAMING 14% 22% 2015 2016 ADVANCED ANALYTICS (MLlib) 13% 18% 2015 2016 In addition to using multiple Apache Spark components, many respondents indicated that they use multiple programing languages in Spark. They also are using multiple components in production, including increased use of Spark Streaming and MLlib. 8
  • 9. REPORT HIGHLIGHTS APACHE SPARK’S FASTEST GROWING AREAS IN 2016 57% STREAMING USERS 2016 22% OF RESPONDENTS 2015 14% OF RESPONDENTS 38% ADVANCED ANALYTICS USERS (MLlib) 2016 18% OF RESPONDENTS 2015 13% OF RESPONDENTS 153% DATAFRAME USERS 2016 38% OF RESPONDENTS 2015 15% OF RESPONDENTS 67% SPARK SQL USERS 2016 40% OF RESPONDENTS 2015 24% OF RESPONDENTS * * * * *component used in production 9
  • 10. REPORT HIGHLIGHTS APACHE SPARK DEPLOYMENT IN PUBLIC CLOUDS INCREASED BY 10% SINCE 2015. 51% of users in the 2015 Spark Survey said they deployed Apache Spark in the public cloud, compared with 61% of users in 2016, showing a growth of 20%. 51% 2015 of respondents deployed in a public cloud 2016 of respondents deploy in a public cloud 61% While Apache Spark deployments in the public cloud increased in 2016, the percentage of Spark deployments on-premises decreased. For example, 48% of users in 2015 Spark survey and 42% in 2016 survey said they used Standalone cluster managers for their on-premises Spark deployments, showing a 13% percentage decrease. Similarly, YARN and Mesos show 10% and 36% percentage decreases respectively in deployments. 2015 2016 40% 48% 36% 42% 2015 2016 STANDALONEYARN ON-PREMISES DEPLOYMENTS YEAR-OVER-YEAR % of respondents who use each (more than one deployment could be selected) 11% 7% 2015 2016 MESOS 10
  • 11. Investments in fast data analytics has surged, according to Datanami. Since companies are shifting investments from batch to real-time applications, respondents in this survey show an affinity toward building real- time applications using the Spark Streaming framework. Among all the streaming engines, 33% of respondents said they were heavy users of Spark Streaming. REPORT HIGHLIGHTS 51% 35% S O M E W H A T I M P O R T A N T N O T I M P O R T A N T 14% of respondents CONSIDER APACHE SPARK STREAMING VERY IMPORTANT 33% of respondents USE APACHE SPARK STREAMING A LOT 11
  • 12. Respondents indicated that Spark Streaming is very important for building real-time streaming, recommendation engines, and fraud detection applications. Machine Learning has seen an increase in production usage. MLlib USE IN PRODUCTION % of respondents who use the component in production REPORT HIGHLIGHTS 40% of respondents develop RECOMMENDATION ENGINE PRODUCTS of respondents develop REAL-TIME STREAMING PRODUCTS 45%29% of respondents develop FRAUD DETECTION / SECURITY PRODUCTS Q: WHICH KINDS OF PRODUCTS DOES YOUR ORGANIZATION DEVELOP? Select all that apply. 13% 18% 2015 2016 38% ADVANCED ANALYTICS PRODUCTION CASES 12
  • 14. The Apache Spark Community is Growing The section identifies key growth areas in all aspects of Spark that are propelling this uptake. Both 2015 and 2016 have seen a tremendous growth in the Spark community and Spark usage in many vertical industries. Spark today remains the most active open source project in Big Data. Today, there are over 1000 Spark contributors, compared to 600 in 2015 from 250+ organizations. With such large numbers of contributors and organizations investing in Spark’s future development, it has engaged a community of developers globally. The Apache Spark Meetup groups’ membership continues to flourish, both nationally and internationally. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES 240% SPARK MEETUP MEMBERS 67% CODE CONTRIBUTORS 2016 1000 2015 600 2016 225,000 2015 66,000 14
  • 15. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES 57% 30% COMPANIES REPRESENTED AT SUMMITS SPARK SUMMIT ATTENDEES 2016 1800 2016 5100 2015 1144 2015 3912 Every year, more users attend Spark Summit, the largest dedicated conference to the Apache Spark project. In 2016 there has been an increased number of attendees from a broad range of organizations attending this event, with attendees ranging from developers to data scientists and engineers; to business users and analysts; and executive level decision makers. A number of notable users presented how they use Spark at the Spark Summit San Francisco 2016. NOTABLE SPARK USERS WHO PRESENTED AT SPARK SUMMIT 2016 15
  • 16. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES 4 RELEASES IN 2015 1.2, 1.3, 1.4, 1.5 2 MAJOR RELEASES IN 2016 1.6, 2.0 75% 18% 7% USE SPARK 1.6 USE SPARK 2.0 OTHER In just two years, the Spark community has released six Spark releases. When asked which version of Apache Spark they are using, 75% responded that they are using Spark 1.6, while 18% are using Spark 2.0 (respondents could choose multiple releases, such as 1.3, or 1.4 or 1.5). as of September 2016 16
  • 17. ADVANCED ANALYTICS USERS (MLlib) IN PRODUCTION Spark’s Fastest Growing Areas from 2015 to 2016 Spark Streaming, in particular, has taken a notable increase in its usage, so has SQL, MLlib, and Windows users from 2015. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES 57% STREAMING USERS IN PRODUCTION 2016 22% OF RESPONDENTS 2015 14% OF RESPONDENTS 153% DATAFRAME USERS IN PRODUCTION 2016 38% OF RESPONDENTS 2015 15% OF RESPONDENTS 67% SPARK SQL USERS IN PRODUCTION 2016 40% OF RESPONDENTS 2015 24% OF RESPONDENTS 38% 2016 18% OF RESPONDENTS 2015 13% OF RESPONDENTS 39% WINDOWS USERS IN DEVELOPMENT 2016 32% OF RESPONDENTS 2015 23% OF RESPONDENTS 17
  • 18. Spark Users are Growing Spark is attractive not only to highly-skilled and technically advanced users. It crosses barriers, and other users such as business analysts increasingly use Spark and develop Spark-based applications in environments other than Linux. From last year, the percentage of Windows users employing Spark has increased. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES LINUX / UNIX 75% 74% 2015 2016 DEVELOPMENT ENVIRONMENTS WINDOWS 23% 32% 2015 2016 MAC OSX 14% 2015 2016 22% 39% WINDOWS USERS YEAR-OVER-YEAR % of respondents who use each development environment (more than one environment could be selected) 18
  • 19. 71% 65% 2015 2016 SCALA Spark Users Employ Multiple Languages Spark is becoming the key data processing and computing platform used by a broad range of users. These users span many vertical industries and use a variety of programming languages. One reason for this broad adoption is because Spark is easy to use and supports familiar programming APIs across these languages. Usage of Spark in Python, SQL, and R increased, while Scala and Java usage decreased. This indicates that more data analysts are drawn to Spark from areas other than pure data engineering, suggesting that Spark usage is expanding to new and diverse users. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES WHICH LANGUAGES DO YOU USE SPARK IN?Q: 58% 62% 2015 2016 PYTHON 31% 29% 2015 2016 JAVA 18% 20% 2015 2016 R 36% 44% 2015 2016 SQL % of respondents who use each language (more than one language could be selected) 19
  • 20. Spark Components Used in Production Since last year, the use of Spark components in production has increased, especially in Spark Streaming and advanced analytics with Apache Spark MLlib (machine learning). This corroborates with the observation in this report about increased interest among Spark users to build real-time streaming applications with Spark Streaming, using multiple components, including MLlib. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES DATAFRAMES 15% 38% 2015 2016 SQL 24% 40% 2015 2016 STREAMING 14% 22% 2015 2016 ADVANCED ANALYTICS (MLlib) 13% 18% 2015 2016 WHICH COMPONENTS OF THE APACHE SPARK STACK ARE YOU USING?Q: 153% 57% 11% STREAMING USERS ADVANCED ANALYTICS USERS 67% SQL USERS % of respondents who use each component in production (more than one component could be selected) DATAFRAMES USERS 20
  • 21. WHAT INDUSTRY VERTICAL BEST DESCRIBES YOUR ORGANIZATION? Spark is Used Widely in Organizations Spark’s adoption continues to grow across varied industries because of its unified engine, and because of its proven performance and versatility that enables it to process diverse workloads. The banking sector saw the highest percentage change in the usage of Spark since 2015, as did the Health, Medical, Biotech and Pharmacy verticals. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES 18% CONSULTING (IT) 25% SOFTWARE (SAAS, WEB, MOBILE) 11% BANKING / FINANCE 7% ADVERTISING / MARKETING / PR 6% ECOMMERCE / RETAIL 5% HEALTH / MEDICAL / PHARMACY / BIOTECH CARRIERS / TELECOM 5% 4% 3% EDUCATION PUBLISHING / MEDIA COMPUTERS / HARDWARE 3% 13% OTHER Q: 29% CONSULTING (IT) USERS 39% HEALTH / MEDICAL / PHARMACY / BIOTECH USERS 63% BANKING USERS 2016 10.58% 2016 5.42% 2016 18.09% 2015 6.48% 2015 3.89% 2015 13.98% Percentages rounded to the nearest integer. 21
  • 22. APACHE SPARK’S GROWTH CONTINUES Users Solve Complex Problems Users are solving complex data problems across varied industry verticals, as Spark’s unified platform enables users to build complex solutions using multiple Spark components for their multiple data workloads. 68% 52% 45% 40% 37% 36% 29% BUSINESS / CUSTOMER INTELLIGENCE DATA WAREHOUSING REAL-TIME / STREAMING SOLUTIONS RECOMMENDATION ENGINES LOG PROCESSING USER-FACING SERVICES FRAUD DETECTION / SECURITY WHICH KINDS OF PRODUCTS DOES YOUR ORGANIZATION DEVELOP?Q: Select all that apply. 22
  • 23. 31% APACHE SPARK’S GROWTH CONTINUES Users Employ Multiple Components Because of Spark’s unified engine and its ability to process multiple workloads within the same cluster, many Spark users within organizations use multiple components of Spark for their use cases and their respective workloads. Not only are Spark components used separately; two or more components are often used in prototyping and production. This unification blurs the barriers between data scientists, data engineers, and data analysts—all using the same unified compute engine. COMPONENTS USED IN PROTOTYPING AND PRODUCTION DATASETS 14% 43% 67% 43% 67% 74% USE TWO OR MORE COMPONENTS of Spark users 64% USE THREE OR MORE COMPONENTS of Spark users GRAPHX MLlib SPARK SQL SPARK STREAMING DATAFRAMES More than one component could be selected. 23
  • 24. APACHE SPARK’S GROWTH CONTINUES What Users Consider Important Users are drawn to Spark for a number of reasons: it’s easier to get started quickly because of simple and consistent APIs; it’s faster because of improvements in Apache Spark 2.0; and it’s smarter because of simplified Structured Streaming APIs, allowing users to build end-to-end continuous applications. According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Apache Spark, along with ease of programming, real-time streaming and advanced analytics. In this year’s survey, Spark users reflect these as equally important. At the time of this survey, Apache Spark 2.0 had just been officially released, and users displayed a keen interest in using it. Even though most users run Spark 1.6, the 2016 survey results suggest they had quickly started using Spark 2.0. % OF RESPONDENTS WHO CONSIDERED THE FEATURE VERY IMPORTANT PERFORMANCE 91% EASE OF PROGRAMMING 76%EASE OF DEPLOYMENT 69% ADVANCED ANALYTICS 82% REAL-TIME STREAMING 51% RUN SPARK 1.6 75% RUN SPARK 2.0 18% More than one feature could be selected. 24
  • 25. 73% WHICH OF THESE TECHNOLOGIES DO YOU CURRENTLY USE? 58% 82% of respondents use KEY-VALUE STORES (NoSQL) of respondents use OPEN-SOURCE SQL DATABASES SPARKAPACHE SPARK’S GROWTH CONTINUES Top Three Storage Technologies A large number of Spark users use technologies for storage other than Apache® Hadoop® , such as Cassandra, MongoDB and NoSQL as well as other open-source and proprietary SQL data stores. Q: of respondents use PROPRIETARY SQL DATABASES Select all that apply. 25
  • 26. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK’S GROWTH CONTINUES Section Summary Apache Spark’s growth and adoption continues as users, industries, development environments, disciplines, and programming languages embrace its ease of use and programming, its unified compute engine, and its performance to solve complex data problems at scale. Spark allows multiple components to work on multiple workloads and access data from multiple data sources. All of these factors make Spark an attractive choice as a unified compute data platform. 26
  • 27. APACHE SPARK IN THE CLOUD IS GROWING 27
  • 28. 2016 51% 2015 Trend: Increase in Public Cloud Deployments The rise of cloud computing is rapid, inexorable and causing a huge upheaval in the tech industry, writes The Economist. “Gartner estimates that about $205 billion, or 6% of the world’s IT budget of $3.4 trillion, will be spent on cloud computing in 2016—a number it expects to grow to $240 billion next year,” according to another article in The Economist. This survey reflects this trend, as many respondents are electing to deploy Spark in the public cloud, mitigating both cost and infrastructure headaches. Since 2015, we have seen a 20% growth of users deploying Spark in the public cloud. That is, 61% users in the 2016 survey said they deployed Spark in the public cloud compared to 51% in 2015. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK IN THE CLOUD IS GROWING SPARK DEPLOYMENT IN PUBLIC CLOUDS HAS INCREASED BY 10% SINCE 2015. 2016 61% of respondents deploy Spark in a public cloud 28
  • 29. Trend: Percentage Decrease in On-Premises Deployments Although many Spark users run Spark on-premises alongside Hadoop and other data sources, some deployment modes in 2016 have seen a percentage decrease. APACHE SPARK STREAMING IS IMPORTANT 2015 2016 WHERE DO YOU RUN SPARK? Q: 11% 40% 48% 7% 36% 42% 2015 20162015 2016 STANDALONEYARNMESOS APACHE SPARK IN THE CLOUD IS GROWING 36% 13% MESOS SPARK DEPLOYMENTS STANDALONE SPARK DEPLOYMENTS 10% YARN SPARK DEPLOYMENTS Select all that apply. 29
  • 30. Section Summary Not only do cloud deployments have lower deployment costs and fewer management headaches, they have higher and proven performance benefits. Using Apache Spark on 206 EC2 machines, we sorted 100TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. APACHE SPARK IN THE CLOUD IS GROWING R E Y N O L D X I N Chief Architect & Co-Founder of Databricks 30
  • 31. APACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE 31
  • 32. VERY IMPORTANT 51% 35% SOMEWHAT IMPORTANT NOT IMPORTANT 14% Q: APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE Apache Spark Streaming is Growing Since its release, Spark Streaming has become one of the most widely used distributed streaming engines. Interest in developing real-time applications and advanced analytics is on the rise. Over half of the survey respondents indicate that streaming is vital and important for developing valuable real-time streaming, recommendation engines, and fraud-detection and security solutions. HOW IMPORTANT IS SPARK STREAMING TO YOUR USE CASE? 40% of respondents develop RECOMMENDATION ENGINE PRODUCTS 45% of respondents develop REAL-TIME STREAMING PRODUCTS 29% of respondents develop FRAUD DETECTION / SECURITY PRODUCTS Q: WHICH KINDS OF PRODUCTS DOES YOUR ORGANIZATION DEVELOP? Select all that apply. 32
  • 33. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE Organizations use Spark Streaming along with Spark’s other multiple components to develop streaming applications. Both Spark Streaming and MLlib saw a notable increase in production use. SPARK STREAMING AND MLlib USE IN PRODUCTION 2015 2016 13% 18% 2015 2016 STREAMING ADVANCED ANALYTICS (MLlib) 14% 22% 57% STREAMING PRODUCTION CASES 38% ADVANCED ANALYTICS PRODUCTION CASES % of respondents who use the component in production (more than one component could be selected) 33
  • 34. WHICH OF THESE TECHNOLOGIES DO YOU CURRENTLY USE A LOT FOR STREAMING AND/OR COMPLEX EVENT PROCESSING CASES? Q: APACHE SPARK STREAMING IS IMPORTANT 1% APACHE APEX APACHE SPARK 4% KINESIS 6% APACHE STORM 1% APACHE FLINK 29% APACHE KAFKA APACHE SPARK COMPONENT POPULARITY % of respondents who use the component anywhere from evaluation to production (more than one component could be selected) SPARK STREAMING 71% MLlib 71% SQL 88% RDDS 8383% DATAFRAMES 89% 33% DO YOU CURRENTLY USE SPARK STREAMING IN PRODUCTION? Q: used it in 2015 14% +57% are using it today SPARK STREAMING PRODUCTION CASES 22% Apache Spark Streaming Engine is the Preferred Choice Compared to other streaming engines, Spark is the preferred choice at 33%. When compared to other Spark components, Spark Streaming matches MLlib at 71% in use, from evaluation to production. In the 2015 Spark survey, 14% of users said they used Spark Streaming in production, compared to 22% of users in 2016. Overall, we saw a 57% growth of users using Spark Streaming in production. APACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE Select all that apply. Note: Respondents were predominately Spark users. 34
  • 35. Section Summary Spark Streaming is being used for real-time solutions, from evaluation to production, closer in usage to Spark’s other commonly used components. As a preferred choice of streaming engine over others, more organizations are building real-time streaming solutions as they consider streaming an important Spark feature. APACHE SPARK STREAMING IS IMPORTANTAPACHE SPARK STREAMING AND MACHINE LEARNING SURGE IN USAGE 35
  • 36. Afterword: Reynold Xin 2015 and 2016 have been exciting years for the adoption and increased growth of Apache Spark and its community. Two releases—Spark 1.6 and 2.0—have seen major improvements in all aspects of Spark noted by respondents in this survey as important. I continue to look forward, and work with the community, to the exciting future ahead for the Spark platform. As Spark becomes easier, faster, and smarter, outside the predominantly IT and Consulting Industry, a newer audience is adopting it, as results from the survey suggest. Performance, ease-of-use, streaming, and reliability top the list as most important features. At the time of this survey, we released Apache Spark 2.0. Ongoing performance improvements, with Project Tungsten, started in earlier releases and culminated in Spark 2.0. In addition, Spark 2.0 delivered unified DataFrames and Datasets APIs and simplified Structured Streaming APIs. All these make Spark an attractive engine for performing advanced analytics across industry verticals in solving complex data problems, by users from different functional roles. Your voice matters. We got an insightful glimpse into the growth and trends from this year’s survey: who’s using Spark, how they are using it, what’s important, what new features they use, and what they are using it for. Just as the feedback from last year’s survey did, these insights will drive major updates and help shape the future of the Spark platform. Thank you to everyone who participated in Databricks’ Apache Spark Survey 2016! R E Y N O L D X I N Chief Architect Co-Founder of Databricks @rxin 36
  • 37. Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact info@databricks.com. TRY DATABRICKS FOR FREE databricks.com/try-databricks CONTACT US FOR A PERSONALIZED DEMO databricks.com/contact-databricks © Databricks 2016. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. 37