SlideShare a Scribd company logo
APACHE SPARK OVERVIEW 
tech talk @ ferret 
Andrii Gakhov
• Apache Spark™ is a fast and general engine for 
large-scale data processing. 
• Lastest release: Spark 1.1.1 (Nov 26, 2014) 
• spark.apache.org 
• Originally developed in 2009 in UC Berkeley’s 
AMPLab, and open sourced in 2010. Now Spark is 
supported by Databricks.
APACHE SPARK 
Spark SQL MLlib GraphX Streaming 
standalone 
with local 
storage 
Apache Spark 
MESOS YARN 
EC2 
S3 HDFS 
node node node node
RDD 
• Spark’s primary conception is a Resilient 
Distributed Dataset (RDD) - abstraction of an 
immutable, distributed dataset. 
textFile = sc.textFile(“api.log") 
anotherFile = sc.textFile(“hdfs://var/log/api.log”) 
• Collections of objects that can be stored in memory 
or disk across the cluster 
• Parallel functional transformations (map, filter, …) 
• Automatically rebuild of failure
RDD 
• RDDs have actions, which retur n values, and 
transformations, which return pointers to new RDDs. 
• Actions: 
• reduce collect count countByKey take saveAsTextFile 
takeSample … 
• Transformations: 
• map filter flatMap distinct sample join union intersection 
reduceByKey groupByKey sortByKey … 
errors = logFile.filter(lambda line: line.startswith(“ERROR”)) 
print errors.count()
PERSISTANCE 
• You can control persistence of RDD across operations 
(MEMORY_ONLY, MEMORY_AND_DISK …) 
• When you persist an RDD in memory, each node stores 
any partitions of it that it computes in memory and 
reuses them in other actions on that dataset (or datasets 
derived from it) 
• This allows future actions to be much faster (often by 
more than 10x). 
errors.cache() 
endpoint_errors = errors.filter( 
lambda line: “/test/endpoint” in line) 
endpoint_errors.count()
HDFS 
iteration iteration iteration 
Hadoop MapReduce 
iteration iteration iteration 
MEMORY HDFS 
Apache Spark
INTERACTIVE DEMO 
STRATA+HADOOP WORD EXAMPLE 
http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html
SPARK SQL 
TRANSFORM RDD WITH SQL
SCHEMA RDD 
• Spark SQL allows relational queries expressed in SQL, 
HiveQL, or Scala to be executed using Spark. 
• At the core of this component is a new type of RDD - 
SchemaRDD. 
• SchemaRDDs are composed of Row objects, along with a 
schema that describes the data types of each column in the row. 
• A SchemaRDD is similar to a table in a traditional relational 
database. 
• A SchemaRDD can be created from an existing RDD, a Parquet 
file, a JSON dataset, or by running HiveQL against data stored in 
Apache Hive.
SCHEMA RDD 
• To work with SparkSQL you need SQLContext 
(or HiveContext) 
from spark.sql import SQLContext 
sqlCtx = SQLContext(sc) 
records = sc.textFile(“customers.csv”) 
customers = records.map(lambda line: line.split(“,”)) 
.map(lambda r: Row(name=r[0], age=int(r[1]))) 
customersTable = sqlCtx.inferSchema(customers) 
customersTable.registerAsTable(“customers”)
SCHEMA RDD 
User 
User 
User 
Name Age Phone 
Name Age Phone 
Name Age Phone 
RDD SchemaRDD 
• Transformations over RDD are just functional 
transformation on partitioned collections of objects 
• Transformation over the SchemaRDD are 
declarative transformations on par titioned 
collections of tuples
SPARK SQL 
• Schema RDD could be used as regular RDD at 
the same time. 
seniors = sqlCtx.sql(“”” 
SELECT from customers WHERE age >= 70”””) 
print seniors.count() 
print seniors.map(lambda r: “Name: “ + r.name).take(10)
MLLIB 
Distributed Machine Learning
MACHINE LEARNING LIBRARY 
• MLlib uses the linear algebra package Breeze, 
which depends on netlib-java, and jblas 
• MLlib in Python requires NumPy version 1.4+ 
• MLlib is under active development 
• Many API changes every release 
• Not all algorithms are fully functional
MACHINE LEARNING LIBRARY 
• Basic statistics 
• Classification and regression 
• linear models (SVMs, logistic regression, linear 
regression) 
• decision trees 
• naive Bayes 
• Collaborative filtering 
• alternating least squares (ALS) 
• Clustering 
• k-means
MACHINE LEARNING LIBRARY 
• Dimensionality reduction 
• singular value decomposition (SVD) 
• principal component analysis (PCA) 
• Feature extraction and transformation 
• Optimization 
• stochastic gradient descent 
• limited-memory BFGS (L-BFGS)
MACHINE LEARNING LIBRARY 
• LinearRegression with stochastic gradient descent (SGD) 
example on Spark: 
def parsePoint(line): 
values = [float(x) for x in line.replace(',', ' ').split(' ')] 
return LabeledPoint(values[0], values[1:]) 
parsedData = data.map(parsePoint) 
model = LinearRegressionWithSGD.train(parsedData) 
valuesAndPreds = parsedData.map( 
lambda p: (p.label, model.predict(p.features))) 
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2) 
.reduce(lambda x, y: x + y) / valuesAndPreds.count()
SPARK STREAMING 
Fault-tolerant stream processing
SPARK STREAMING 
• Spark Streaming enables scalable, high-throughput, 
fault-tolerant stream processing of live data streams 
• Spark Streaming provides a high-level abstraction 
called discretized stream or DStream, which 
represents a continuous stream of data 
• Internally, a DStream is represented as a sequence 
of RDDs.
SPARK STREAMING 
• Example of processing Twitter Stream with Spark 
Streaming: 
import org.apache.spark.streaming._ 
import org.apache.spark.streaming.twitter._ 
… 
val ssc = new StreamingContext(sc, Seconds(1)) 
val tweets = TwitterUtils.createStream(ssc, auth) 
val hashTags = tweets.flatMap(status=>getTags(status)) 
hashTags.saveAsHadoopFiles("hdfs://...")
SPARK STREAMING 
• Any operation applied on a DStream translates to 
operations on the underlying RDDs. 
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
SPARK STREAMING 
• Spark Streaming also provides windowed 
computations, which allow you to apply 
transformations over a sliding window of data
CONCLUSIONS
SPEED 
• Run programs up to 100x faster than Hadoop 
MapReduce in memory, or 10x faster on disk. 
Logistic regression 
in Hadoop and Spark 
• Spark has won the Daytona GraySort contest for 
2014 (sortbenchmark.org) with 4.27 TB/min 
(in 2013 Hadoop was the winner with 1.42 TB/min)
EASE OF USE 
• Supports out of the box: 
• Java 
• Scala 
• Python 
• You can use it interactively from the Scala and 
Python shells
GENERALITY 
• SQL with SparkSQL 
• Machine Learning with MLlib 
• Graphs computation with GraphX 
• Streaming processing with Spark Streaming
RUNS EVERYWHERE 
• Spark could be run on 
• Hadoop (YARN) 
• Mesos 
• standalone 
• in the cloud 
• Spark can read from 
• S3 
• HDFS 
• HBase 
• Cassandra 
• any Hadoop data source.
Apache Spark Overview @ ferret
Thank you. 
• Credentials: 
• http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014 
• http://spark.apache.org 
• http://www.databricks.com 
• http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona- 
2014-with-apache-spark-and-d3.html

More Related Content

PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PDF
Apache Spark Overview
PPTX
The Evolution of the Hadoop Ecosystem
PDF
Apache Spark & Hadoop
PPTX
Introduction to the Hadoop EcoSystem
PPTX
Intro to Apache Spark by Marco Vasquez
PDF
Hadoop Spark Introduction-20150130
PDF
Fast Data Analytics with Spark and Python
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Apache Spark Overview
The Evolution of the Hadoop Ecosystem
Apache Spark & Hadoop
Introduction to the Hadoop EcoSystem
Intro to Apache Spark by Marco Vasquez
Hadoop Spark Introduction-20150130
Fast Data Analytics with Spark and Python

What's hot (20)

PDF
Hadoop and Spark
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Hadoop ecosystem
PDF
Introduction to Apache Spark Ecosystem
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Real time hadoop + mapreduce intro
PDF
Spark vs Hadoop
PDF
Introduction to Spark on Hadoop
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
Introduction to Hadoop and Hadoop component
PDF
Hd insight essentials quick view
PDF
Tachyon and Apache Spark
PDF
Hadoop ecosystem
PPTX
Hadoop And Their Ecosystem
PPTX
Working with the Scalding Type -Safe API
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PPTX
Spark architecture
PPTX
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Hadoop and Spark
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Hadoop ecosystem
Introduction to Apache Spark Ecosystem
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Apache Spark: The Next Gen toolset for Big Data Processing
Real time hadoop + mapreduce intro
Spark vs Hadoop
Introduction to Spark on Hadoop
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to Hadoop and Hadoop component
Hd insight essentials quick view
Tachyon and Apache Spark
Hadoop ecosystem
Hadoop And Their Ecosystem
Working with the Scalding Type -Safe API
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Spark architecture
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
The Future of Hadoop: A deeper look at Apache Spark
Ad

Viewers also liked (20)

PPTX
Spark - The beginnings
PDF
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
PPTX
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
PDF
Apache Spark
PDF
Introduction to Apache Spark
PDF
Apache spark linkedin
PDF
New directions for Apache Spark in 2015
PDF
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
PPTX
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Apache spark - Spark's distributed programming model
PPTX
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Apache Spark & Streaming
PDF
End-to-end Data Pipeline with Apache Spark
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PPTX
Intro to Spark development
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
PDF
New Developments in Spark
Spark - The beginnings
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Apache Spark
Introduction to Apache Spark
Apache spark linkedin
New directions for Apache Spark in 2015
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Introduction to Spark (Intern Event Presentation)
Spark Under the Hood - Meetup @ Data Science London
Apache spark - Spark's distributed programming model
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Jump Start with Apache Spark 2.0 on Databricks
Apache Spark & Streaming
End-to-end Data Pipeline with Apache Spark
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Intro to Spark development
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
New Developments in Spark
Ad

Similar to Apache Spark Overview @ ferret (20)

PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Spark meetup TCHUG
PDF
Bds session 13 14
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Unified Big Data Processing with Apache Spark
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
Building highly scalable data pipelines with Apache Spark
PDF
[@NaukriEngineering] Apache Spark
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Spark streaming , Spark SQL
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Glint with Apache Spark
PDF
Spark Programming Basic Training Handout
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Simplifying Big Data Analytics with Apache Spark
Spark meetup TCHUG
Bds session 13 14
Intro to Apache Spark by CTO of Twingo
Spark Summit East 2015 Advanced Devops Student Slides
Unified Big Data Processing with Apache Spark
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Building highly scalable data pipelines with Apache Spark
[@NaukriEngineering] Apache Spark
Spark Concepts - Spark SQL, Graphx, Streaming
Unified Big Data Processing with Apache Spark (QCON 2014)
Big_data_analytics_NoSql_Module-4_Session
Spark streaming , Spark SQL
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Glint with Apache Spark
Spark Programming Basic Training Handout
Apache spark - Architecture , Overview & libraries
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf

More from Andrii Gakhov (20)

PDF
Let's start GraphQL: structure, behavior, and architecture
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
PDF
Too Much Data? - Just Sample, Just Hash, ...
PDF
DNS Delegation
PPTX
Implementing a Fileserver with Nginx and Lua
PPTX
Pecha Kucha: Ukrainian Food Traditions
PDF
Probabilistic data structures. Part 4. Similarity
PDF
Probabilistic data structures. Part 3. Frequency
PDF
Probabilistic data structures. Part 2. Cardinality
PDF
Вероятностные структуры данных
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
Apache Big Data Europe 2015: Selected Talks
PDF
Swagger / Quick Start Guide
PDF
API Days Berlin highlights
PDF
ELK - What's new and showcases
PDF
Data Mining - lecture 8 - 2014
PDF
Data Mining - lecture 7 - 2014
PDF
Data Mining - lecture 6 - 2014
PDF
Data Mining - lecture 5 - 2014
PDF
Data Mining - lecture 4 - 2014
Let's start GraphQL: structure, behavior, and architecture
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Too Much Data? - Just Sample, Just Hash, ...
DNS Delegation
Implementing a Fileserver with Nginx and Lua
Pecha Kucha: Ukrainian Food Traditions
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 2. Cardinality
Вероятностные структуры данных
Recurrent Neural Networks. Part 1: Theory
Apache Big Data Europe 2015: Selected Talks
Swagger / Quick Start Guide
API Days Berlin highlights
ELK - What's new and showcases
Data Mining - lecture 8 - 2014
Data Mining - lecture 7 - 2014
Data Mining - lecture 6 - 2014
Data Mining - lecture 5 - 2014
Data Mining - lecture 4 - 2014

Recently uploaded (20)

PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
Complete React Javascript Course Syllabus.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
history of c programming in notes for students .pptx
PPTX
Transform Your Business with a Software ERP System
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Materi-Enum-and-Record-Data-Type (1).pptx
Complete React Javascript Course Syllabus.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Adobe Illustrator 28.6 Crack My Vision of Vector Design
history of c programming in notes for students .pptx
Transform Your Business with a Software ERP System
Understanding Forklifts - TECH EHS Solution
Essential Infomation Tech presentation.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
PTS Company Brochure 2025 (1).pdf.......
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03
VVF-Customer-Presentation2025-Ver1.9.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
How to Migrate SBCGlobal Email to Yahoo Easily
Upgrade and Innovation Strategies for SAP ERP Customers

Apache Spark Overview @ ferret

  • 1. APACHE SPARK OVERVIEW tech talk @ ferret Andrii Gakhov
  • 2. • Apache Spark™ is a fast and general engine for large-scale data processing. • Lastest release: Spark 1.1.1 (Nov 26, 2014) • spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010. Now Spark is supported by Databricks.
  • 3. APACHE SPARK Spark SQL MLlib GraphX Streaming standalone with local storage Apache Spark MESOS YARN EC2 S3 HDFS node node node node
  • 4. RDD • Spark’s primary conception is a Resilient Distributed Dataset (RDD) - abstraction of an immutable, distributed dataset. textFile = sc.textFile(“api.log") anotherFile = sc.textFile(“hdfs://var/log/api.log”) • Collections of objects that can be stored in memory or disk across the cluster • Parallel functional transformations (map, filter, …) • Automatically rebuild of failure
  • 5. RDD • RDDs have actions, which retur n values, and transformations, which return pointers to new RDDs. • Actions: • reduce collect count countByKey take saveAsTextFile takeSample … • Transformations: • map filter flatMap distinct sample join union intersection reduceByKey groupByKey sortByKey … errors = logFile.filter(lambda line: line.startswith(“ERROR”)) print errors.count()
  • 6. PERSISTANCE • You can control persistence of RDD across operations (MEMORY_ONLY, MEMORY_AND_DISK …) • When you persist an RDD in memory, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it) • This allows future actions to be much faster (often by more than 10x). errors.cache() endpoint_errors = errors.filter( lambda line: “/test/endpoint” in line) endpoint_errors.count()
  • 7. HDFS iteration iteration iteration Hadoop MapReduce iteration iteration iteration MEMORY HDFS Apache Spark
  • 8. INTERACTIVE DEMO STRATA+HADOOP WORD EXAMPLE http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html
  • 9. SPARK SQL TRANSFORM RDD WITH SQL
  • 10. SCHEMA RDD • Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. • At the core of this component is a new type of RDD - SchemaRDD. • SchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row. • A SchemaRDD is similar to a table in a traditional relational database. • A SchemaRDD can be created from an existing RDD, a Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.
  • 11. SCHEMA RDD • To work with SparkSQL you need SQLContext (or HiveContext) from spark.sql import SQLContext sqlCtx = SQLContext(sc) records = sc.textFile(“customers.csv”) customers = records.map(lambda line: line.split(“,”)) .map(lambda r: Row(name=r[0], age=int(r[1]))) customersTable = sqlCtx.inferSchema(customers) customersTable.registerAsTable(“customers”)
  • 12. SCHEMA RDD User User User Name Age Phone Name Age Phone Name Age Phone RDD SchemaRDD • Transformations over RDD are just functional transformation on partitioned collections of objects • Transformation over the SchemaRDD are declarative transformations on par titioned collections of tuples
  • 13. SPARK SQL • Schema RDD could be used as regular RDD at the same time. seniors = sqlCtx.sql(“”” SELECT from customers WHERE age >= 70”””) print seniors.count() print seniors.map(lambda r: “Name: “ + r.name).take(10)
  • 15. MACHINE LEARNING LIBRARY • MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas • MLlib in Python requires NumPy version 1.4+ • MLlib is under active development • Many API changes every release • Not all algorithms are fully functional
  • 16. MACHINE LEARNING LIBRARY • Basic statistics • Classification and regression • linear models (SVMs, logistic regression, linear regression) • decision trees • naive Bayes • Collaborative filtering • alternating least squares (ALS) • Clustering • k-means
  • 17. MACHINE LEARNING LIBRARY • Dimensionality reduction • singular value decomposition (SVD) • principal component analysis (PCA) • Feature extraction and transformation • Optimization • stochastic gradient descent • limited-memory BFGS (L-BFGS)
  • 18. MACHINE LEARNING LIBRARY • LinearRegression with stochastic gradient descent (SGD) example on Spark: def parsePoint(line): values = [float(x) for x in line.replace(',', ' ').split(' ')] return LabeledPoint(values[0], values[1:]) parsedData = data.map(parsePoint) model = LinearRegressionWithSGD.train(parsedData) valuesAndPreds = parsedData.map( lambda p: (p.label, model.predict(p.features))) MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2) .reduce(lambda x, y: x + y) / valuesAndPreds.count()
  • 19. SPARK STREAMING Fault-tolerant stream processing
  • 20. SPARK STREAMING • Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data • Internally, a DStream is represented as a sequence of RDDs.
  • 21. SPARK STREAMING • Example of processing Twitter Stream with Spark Streaming: import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ … val ssc = new StreamingContext(sc, Seconds(1)) val tweets = TwitterUtils.createStream(ssc, auth) val hashTags = tweets.flatMap(status=>getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...")
  • 22. SPARK STREAMING • Any operation applied on a DStream translates to operations on the underlying RDDs. RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
  • 23. SPARK STREAMING • Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data
  • 25. SPEED • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Logistic regression in Hadoop and Spark • Spark has won the Daytona GraySort contest for 2014 (sortbenchmark.org) with 4.27 TB/min (in 2013 Hadoop was the winner with 1.42 TB/min)
  • 26. EASE OF USE • Supports out of the box: • Java • Scala • Python • You can use it interactively from the Scala and Python shells
  • 27. GENERALITY • SQL with SparkSQL • Machine Learning with MLlib • Graphs computation with GraphX • Streaming processing with Spark Streaming
  • 28. RUNS EVERYWHERE • Spark could be run on • Hadoop (YARN) • Mesos • standalone • in the cloud • Spark can read from • S3 • HDFS • HBase • Cassandra • any Hadoop data source.
  • 30. Thank you. • Credentials: • http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014 • http://spark.apache.org • http://www.databricks.com • http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona- 2014-with-apache-spark-and-d3.html