SlideShare a Scribd company logo
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
WHY SPARK?
RDD-Based Matrices
“In Spark, we explicitly wanted to come up with a
single programming model that is very general
that covers these interactive [SQL] use cases, the
streaming ones, the more complex applications.
I think the thing that really sets Spark apart
compared to some other systems that tackle these
is that it can actually do all of them. You only have
to learn one system and you can easily make an
application that combines these. It’s only one
thing to manage, and I think that’s what gets
people interested in it.”
Databricks co-founder and CTO Matei Zaharia (source)
WHY SPARK?
RDD-Based Matrices
June 2013 June 2014
contributors 68 255
companies 17 50
lines of code 63000 175000
Spark one of the most active projects at Apache Spark is the most active project in the Hadoop ecosystem
[Data source: Git logs; chart
courtesy of Matei Zaharia]
Spark, real world use cases, by
Datanami
Spark role in the Big Data
ecosystem, by databricks
WHY SPARK?
Since Spark was open sourced it has generated rapid interest–with
over 200 contributors from 50+ organizations collaborating around
the project;
Open source contributors, Cloudera, Databricks, IBM, Intel, and MapR
announced last july that they are joining efforts to broaden support for
Apache Spark (Spark), while simultaneously standardizing it as the
framework of choice by bringing popular tools from the MapReduce world
to this new engine.
Spark has quickly become a standard in many Hadoop distributions, with
rapid customer adoption and use in a variety of use cases, ranging from
machine learning to stream processing workloads.
WHY SPARK?
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.
All these benchmarks are public and available at Apache Spark website
WHY SPARK?
As a general iterative computing framework,
Spark is the core of many other products, such
as Spark SQL, Spark Streaming, MLlib or
GraphX.
Every contribution and benefit added to Spark
core will be immediately added to the other
modules.
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
RDD as a general data abstraction allow Spark
to talk to many file systems and databases.
In fact, all that support Hadoop Input format
could be easily integrated into Spark.
WHY SPARK?
One of the challenges organizations face when
adopting Hadoop is a shortage of developers who have
experience building Hadoop applications.
Our professional services organization has helped
dozens of companies with the development and
deployment of Hadoop applications, and our training
department has trained countless engineers.
Organizations are hungry for solutions that make it
easier to develop Hadoop applications while increasing
developer productivity, and Spark fits this bill. Spark
jobs can require as little as 1/5th the number of
lines of code.
by Tomer Shiran, VP of Product Management, MapR
MapR Integrates the Complete Spark Stack
WHY SPARK?
RDDs remember the sequence of operations that created it from the original fault-tolerant input data
Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant
Data lost due to worker failure, can be recomputed from input data
Recovers from
faults/stragglers
within 1 sec
Spark Streaming, at Strata Conference, February 2013
WHY SPARK?
Hadoop does a pretty terrible job with machine learning. Spark is good with logistic regression, and
that can help with anything that involves a binary decision: Is this message spam? Should I show this
ad to this user?
— Reynold Xin (source)
Spark is amazing for iterative computing (Machine Learning
algorithms) and interactive analytics.
Most ML algorithms run on the same data set iteratively and
in MapReduce , there was no easy way to communicate a
shared state from iteration to iteration.
MLlib was added to the spark ecosystem and now is one of
the most active modules.
In addition, SparkR is in its way and Mahout is working to
incorporate the benefits of Spark and is exploring other high
performance back-ends as well.
WHY SPARK?
Spark is Java but also embraces Python and Scala and it
provides a set of pre-defined APIs for building new
programs.
Code with spark in your machine and deploy in a cluster.
WHY SPARK?
Spark can run on hardware clusters managed by Apache Mesos. the advantages include:
• dynamic partitioning between Spark and other frameworks
• scalable partitioning between multiple instances of Spark
If you decide to run Spark on YARN, you can decide on an application-by-application basis whether to
run in YARN client mode or cluster mode. When you run Spark in client mode, the driver process runs
locally; in cluster mode, it runs remotely on an ApplicationMaster.
Why spark by Stratio - v.1.0

More Related Content

PDF
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
PDF
An efficient data mining solution by integrating Spark and Cassandra
PPTX
[Strata] Sparkta
PDF
Multiplaform Solution for Graph Datasources
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
PDF
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
PDF
Spark Summit - Stratio Streaming
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
An efficient data mining solution by integrating Spark and Cassandra
[Strata] Sparkta
Multiplaform Solution for Graph Datasources
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Spark Summit - Stratio Streaming
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee

What's hot (20)

PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
PPTX
Realtime streaming architecture in INFINARIO
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
PDF
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PPTX
Apache Spark Model Deployment
PDF
Big Telco - Yousun Jeong
PDF
Databricks with R: Deep Dive
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PPTX
Data Science at Scale by Sarah Guido
PDF
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
PDF
Rethinking Streaming Analytics For Scale
PDF
Strata EU 2014: Spark Streaming Case Studies
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
How Spark Enables the Internet of Things- Paula Ta-Shma
Realtime streaming architecture in INFINARIO
Implementing the Lambda Architecture efficiently with Apache Spark
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Apache Spark Model Deployment
Big Telco - Yousun Jeong
Databricks with R: Deep Dive
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
New Directions for Spark in 2015 - Spark Summit East
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Data Science at Scale by Sarah Guido
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Rethinking Streaming Analytics For Scale
Strata EU 2014: Spark Streaming Case Studies
Spark Summit EU talk by Christos Erotocritou
Ad

Similar to Why spark by Stratio - v.1.0 (20)

PDF
Started with-apache-spark
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PDF
Apache spark
PDF
spark_v1_2
PDF
SparkPaper
PDF
Apache Spark Introduction.pdf
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPT
An Introduction to Apache spark with scala
PDF
Spark forplainoldjavageeks svforum_20140724
PPTX
Introduction to spark
PDF
Hadoop vs spark
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
DOCX
INFO491FinalPaper
PPTX
Getting Started with Apache Spark (Scala)
PDF
Why Spark over Hadoop?
PPTX
Apache spark architecture (Big Data and Analytics)
PDF
BDTC2015 databricks-辛湜-state of spark
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PPTX
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Started with-apache-spark
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Apache spark
spark_v1_2
SparkPaper
Apache Spark Introduction.pdf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
An Introduction to Apache spark with scala
Spark forplainoldjavageeks svforum_20140724
Introduction to spark
Hadoop vs spark
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
INFO491FinalPaper
Getting Started with Apache Spark (Scala)
Why Spark over Hadoop?
Apache spark architecture (Big Data and Analytics)
BDTC2015 databricks-辛湜-state of spark
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Ad

More from Stratio (20)

PPTX
Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...
PPTX
Can an intelligent system exist without awareness? BDS18
PPTX
Kafka and KSQL - Apache Kafka Meetup
PPTX
Wild Data - The Data Science Meetup
PPTX
Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
PPTX
Ensemble methods in Machine Learning
PPTX
Stratio Sparta 2.0
PPTX
Big Data Security: Facing the challenge
PPTX
Operationalizing Big Data
PPTX
Artificial Intelligence on Data Centric Platform
PDF
Introduction to Artificial Neural Networks
PDF
“A Distributed Operational and Informational Technological Stack”
PDF
Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...
PPTX
Lunch&Learn: Combinación de modelos
PDF
Meetup: Spark + Kerberos
PDF
Distributed Logistic Model Trees
PDF
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
PDF
Introduction to Asynchronous scala
PDF
Functional programming in scala
PDF
Advanced search and Top-K queries in Cassandra
Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...
Can an intelligent system exist without awareness? BDS18
Kafka and KSQL - Apache Kafka Meetup
Wild Data - The Data Science Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Ensemble methods in Machine Learning
Stratio Sparta 2.0
Big Data Security: Facing the challenge
Operationalizing Big Data
Artificial Intelligence on Data Centric Platform
Introduction to Artificial Neural Networks
“A Distributed Operational and Informational Technological Stack”
Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...
Lunch&Learn: Combinación de modelos
Meetup: Spark + Kerberos
Distributed Logistic Model Trees
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Introduction to Asynchronous scala
Functional programming in scala
Advanced search and Top-K queries in Cassandra

Recently uploaded (20)

PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Computer network topology notes for revision
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Database Infoormation System (DBIS).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Analytics and business intelligence.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
STUDY DESIGN details- Lt Col Maksud (21).pptx
SAP 2 completion done . PRESENTATION.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx

Why spark by Stratio - v.1.0

  • 3. WHY SPARK? RDD-Based Matrices “In Spark, we explicitly wanted to come up with a single programming model that is very general that covers these interactive [SQL] use cases, the streaming ones, the more complex applications. I think the thing that really sets Spark apart compared to some other systems that tackle these is that it can actually do all of them. You only have to learn one system and you can easily make an application that combines these. It’s only one thing to manage, and I think that’s what gets people interested in it.” Databricks co-founder and CTO Matei Zaharia (source)
  • 4. WHY SPARK? RDD-Based Matrices June 2013 June 2014 contributors 68 255 companies 17 50 lines of code 63000 175000 Spark one of the most active projects at Apache Spark is the most active project in the Hadoop ecosystem [Data source: Git logs; chart courtesy of Matei Zaharia] Spark, real world use cases, by Datanami Spark role in the Big Data ecosystem, by databricks
  • 5. WHY SPARK? Since Spark was open sourced it has generated rapid interest–with over 200 contributors from 50+ organizations collaborating around the project; Open source contributors, Cloudera, Databricks, IBM, Intel, and MapR announced last july that they are joining efforts to broaden support for Apache Spark (Spark), while simultaneously standardizing it as the framework of choice by bringing popular tools from the MapReduce world to this new engine. Spark has quickly become a standard in many Hadoop distributions, with rapid customer adoption and use in a variety of use cases, ranging from machine learning to stream processing workloads.
  • 6. WHY SPARK? Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. All these benchmarks are public and available at Apache Spark website
  • 7. WHY SPARK? As a general iterative computing framework, Spark is the core of many other products, such as Spark SQL, Spark Streaming, MLlib or GraphX. Every contribution and benefit added to Spark core will be immediately added to the other modules. Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) RDD as a general data abstraction allow Spark to talk to many file systems and databases. In fact, all that support Hadoop Input format could be easily integrated into Spark.
  • 8. WHY SPARK? One of the challenges organizations face when adopting Hadoop is a shortage of developers who have experience building Hadoop applications. Our professional services organization has helped dozens of companies with the development and deployment of Hadoop applications, and our training department has trained countless engineers. Organizations are hungry for solutions that make it easier to develop Hadoop applications while increasing developer productivity, and Spark fits this bill. Spark jobs can require as little as 1/5th the number of lines of code. by Tomer Shiran, VP of Product Management, MapR MapR Integrates the Complete Spark Stack
  • 9. WHY SPARK? RDDs remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant Data lost due to worker failure, can be recomputed from input data Recovers from faults/stragglers within 1 sec Spark Streaming, at Strata Conference, February 2013
  • 10. WHY SPARK? Hadoop does a pretty terrible job with machine learning. Spark is good with logistic regression, and that can help with anything that involves a binary decision: Is this message spam? Should I show this ad to this user? — Reynold Xin (source) Spark is amazing for iterative computing (Machine Learning algorithms) and interactive analytics. Most ML algorithms run on the same data set iteratively and in MapReduce , there was no easy way to communicate a shared state from iteration to iteration. MLlib was added to the spark ecosystem and now is one of the most active modules. In addition, SparkR is in its way and Mahout is working to incorporate the benefits of Spark and is exploring other high performance back-ends as well.
  • 11. WHY SPARK? Spark is Java but also embraces Python and Scala and it provides a set of pre-defined APIs for building new programs. Code with spark in your machine and deploy in a cluster.
  • 12. WHY SPARK? Spark can run on hardware clusters managed by Apache Mesos. the advantages include: • dynamic partitioning between Spark and other frameworks • scalable partitioning between multiple instances of Spark If you decide to run Spark on YARN, you can decide on an application-by-application basis whether to run in YARN client mode or cluster mode. When you run Spark in client mode, the driver process runs locally; in cluster mode, it runs remotely on an ApplicationMaster.