SlideShare a Scribd company logo
RealTime DataProcessing with
Spark Streaming
Brandon O’Brien
Oct 26th, 2016
Spark Streaming: Intro
1. Intro
2. Demo + highlevelwalkthrough
3. Sparkin detail
4. Detaileddemowalkthroughand/or
workshop
Spark Streaming
Sparkexperiencelevel?
Selectone:
 Beginner
 Intermediate
 Expert
Spark Streaming: Demo
DEMO
Spark Streaming: Demo Info
• Data Source:
• Data Producer Thread
• Redis
• Data Consumer
• Spark as Stream Consumer
• Redis Publish
• Dashboard:
• Node.js/Redis Integration
• Socket.io Publish
• AngularJS + JavaScript
Spark Streaming: Spark in detail
SPARK IN DETAIL
Spark Streaming: Concepts
Application:
• Driver program
• RDD
• Partition
• Elements
• DStream
• InputReceiver
• 1 JVM for driver
program
• 1 JVM per executor
Cluster:
• Master
• Executors
• Resources
• Cores
• Gigs RAM
• Cluster Types:
• Standalone
• Mesos
• YARN
Spark Streaming: Lazy execution
//Allocate resources on cluster
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
//Lazy definition of logical processing (transformations)
val textFile = sc.textFile("README.md")
.filter(line=> {line.length> 10})
//foreachPartition() triggers execution (actions)
textFile.foreachPartition(partition=> {
partition.foreach(line => {
println(line)
})
})
• Use rdd.persist() when multiple actions are called on the same RDD
Spark Streaming: Execution Env
• Distributed data, distributed code
• RDD partitions are distributed across executors
• Actions trigger execution and return results to the driver program
• Code is executed on either the driver or executors
• Be careful of function closures!
//Function arguments to transformations executed on executors
val textFile = sc.textFile("README.md")
.filter(line=> {line.length> 10})
//collect() triggers execution (actions)
//executed on driver. foreachPartition executed on executors
textFile.collect().foreach(line => {
println(line)
})
Spark Streaming: Execution Env
Spark Streaming: Parallelism
• RDD partitions are processed in parallel
• Elements in a single partition are processed serially
• You control the number of partitions in an RDD
• If you need to guarantee any particular ordering of processing, use
groupByKey() to force all elements with the same key onto the same
partitions
• Be careful of shuffles
val textFile = sc.textFile("README.md”)
val singlePartitionRDD = textFile.repartition(1)
val linesByKey = shopResultsEnriched
.map(line => (getPartitionKey(line), line))
.groupByKey()
Spark Streaming: DStreams
• Receiver Types
• Kafka (Receiver + Direct)
• Flume
• Kinesis
• TCP Socket
• Custom (Ex: redis.receiver.RedisReceiver.scala)
• Note: Kafka receiver will consume an entire core (no context switch)
RealTime DataProcessing with
Spark Streaming
Brandon O’Brien
Oct 26th, 2016

More Related Content

PDF
Building Event Driven (Micro)services with Apache Kafka
PPTX
Centralized log-management-with-elastic-stack
PDF
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
PPTX
Ethereum (Blockchain Network)
PDF
Micro frontend: The microservices puzzle extended to frontend
PDF
Apache Spark & Hadoop
PDF
Actor Model Akka Framework
PPTX
Log analysis using elk
Building Event Driven (Micro)services with Apache Kafka
Centralized log-management-with-elastic-stack
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Ethereum (Blockchain Network)
Micro frontend: The microservices puzzle extended to frontend
Apache Spark & Hadoop
Actor Model Akka Framework
Log analysis using elk

What's hot (20)

PDF
Introduction to Apache NiFi 1.11.4
PPTX
Ms.azure in detail
PDF
실시간 이상탐지를 위한 머신러닝 모델에 Druid _ Imply 활용하기
PPTX
Splunk Enterprise for IT Troubleshooting
PDF
JavaFx - Introdução
PPTX
Sizing MongoDB Clusters
PDF
Blockchain-Based Transformation: A Gartner Trend Insight Report
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
Saml vs Oauth : Which one should I use?
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PDF
Accounting Package -Tally Basic Concept of Accounting (HINDI).pdf
PPTX
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
PDF
Teach-In Electronics
PDF
WebSockets wiith Scala and Play! Framework
PPTX
Microservices with Docker
PDF
엘라스틱서치, 로그스태시, 키바나
PPTX
Microservice vs. Monolithic Architecture
PDF
What is Blockchain Technology?
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
Blockchain basics
Introduction to Apache NiFi 1.11.4
Ms.azure in detail
실시간 이상탐지를 위한 머신러닝 모델에 Druid _ Imply 활용하기
Splunk Enterprise for IT Troubleshooting
JavaFx - Introdução
Sizing MongoDB Clusters
Blockchain-Based Transformation: A Gartner Trend Insight Report
Distributed Computing with Apache Hadoop: Technology Overview
Saml vs Oauth : Which one should I use?
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Accounting Package -Tally Basic Concept of Accounting (HINDI).pdf
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
Teach-In Electronics
WebSockets wiith Scala and Play! Framework
Microservices with Docker
엘라스틱서치, 로그스태시, 키바나
Microservice vs. Monolithic Architecture
What is Blockchain Technology?
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Blockchain basics
Ad

Viewers also liked (7)

PPTX
Introduction to Streaming Distributed Processing with Storm
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PPTX
Data Science with Spark & Zeppelin
PDF
Manual de programacion_con_robots_para_la_escuela
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PDF
Big Data visualization with Apache Spark and Zeppelin
Introduction to Streaming Distributed Processing with Storm
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Data Science with Spark & Zeppelin
Manual de programacion_con_robots_para_la_escuela
Real time data viz with Spark Streaming, Kafka and D3.js
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Big Data visualization with Apache Spark and Zeppelin
Ad

Similar to Real Time Data Processing With Spark Streaming, Node.js and Redis with Visualization (20)

PPTX
Spark 101 - First steps to distributed computing
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
PPTX
Introduction to Apache Spark
PPTX
Seattle Spark Meetup Mobius CSharp API
PDF
Spark Programming
PPTX
Dive into spark2
PPTX
Apache Spark on HDinsight Training
PPTX
Bring the Spark To Your Eyes
PPTX
Apache Spark Workshop
PPTX
Spark core
PPTX
Spark in the Maritime Domain
PDF
실시간 Streaming using Spark and Kafka 강의교재
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Apache Spark: What's under the hood
PPT
Apache spark-melbourne-april-2015-meetup
PDF
20170126 big data processing
PDF
Fast Data Analytics with Spark and Python
Spark 101 - First steps to distributed computing
Intro to Apache Spark
Intro to Apache Spark
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Introduction to Apache Spark
Seattle Spark Meetup Mobius CSharp API
Spark Programming
Dive into spark2
Apache Spark on HDinsight Training
Bring the Spark To Your Eyes
Apache Spark Workshop
Spark core
Spark in the Maritime Domain
실시간 Streaming using Spark and Kafka 강의교재
Real time Analytics with Apache Kafka and Apache Spark
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Apache Spark: What's under the hood
Apache spark-melbourne-april-2015-meetup
20170126 big data processing
Fast Data Analytics with Spark and Python

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Computer network topology notes for revision
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Business Analytics and business intelligence.pdf
1_Introduction to advance data techniques.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to machine learning and Linear Models
SAP 2 completion done . PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
STUDY DESIGN details- Lt Col Maksud (21).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Introduction to Data Science and Data Analysis
Introduction-to-Cloud-ComputingFinal.pptx
Qualitative Qantitative and Mixed Methods.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Computer network topology notes for revision
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Analytics and business intelligence.pdf

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visualization

  • 1. RealTime DataProcessing with Spark Streaming Brandon O’Brien Oct 26th, 2016
  • 2. Spark Streaming: Intro 1. Intro 2. Demo + highlevelwalkthrough 3. Sparkin detail 4. Detaileddemowalkthroughand/or workshop
  • 5. Spark Streaming: Demo Info • Data Source: • Data Producer Thread • Redis • Data Consumer • Spark as Stream Consumer • Redis Publish • Dashboard: • Node.js/Redis Integration • Socket.io Publish • AngularJS + JavaScript
  • 6. Spark Streaming: Spark in detail SPARK IN DETAIL
  • 7. Spark Streaming: Concepts Application: • Driver program • RDD • Partition • Elements • DStream • InputReceiver • 1 JVM for driver program • 1 JVM per executor Cluster: • Master • Executors • Resources • Cores • Gigs RAM • Cluster Types: • Standalone • Mesos • YARN
  • 8. Spark Streaming: Lazy execution //Allocate resources on cluster val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) //Lazy definition of logical processing (transformations) val textFile = sc.textFile("README.md") .filter(line=> {line.length> 10}) //foreachPartition() triggers execution (actions) textFile.foreachPartition(partition=> { partition.foreach(line => { println(line) }) }) • Use rdd.persist() when multiple actions are called on the same RDD
  • 9. Spark Streaming: Execution Env • Distributed data, distributed code • RDD partitions are distributed across executors • Actions trigger execution and return results to the driver program • Code is executed on either the driver or executors • Be careful of function closures! //Function arguments to transformations executed on executors val textFile = sc.textFile("README.md") .filter(line=> {line.length> 10}) //collect() triggers execution (actions) //executed on driver. foreachPartition executed on executors textFile.collect().foreach(line => { println(line) })
  • 11. Spark Streaming: Parallelism • RDD partitions are processed in parallel • Elements in a single partition are processed serially • You control the number of partitions in an RDD • If you need to guarantee any particular ordering of processing, use groupByKey() to force all elements with the same key onto the same partitions • Be careful of shuffles val textFile = sc.textFile("README.md”) val singlePartitionRDD = textFile.repartition(1) val linesByKey = shopResultsEnriched .map(line => (getPartitionKey(line), line)) .groupByKey()
  • 12. Spark Streaming: DStreams • Receiver Types • Kafka (Receiver + Direct) • Flume • Kinesis • TCP Socket • Custom (Ex: redis.receiver.RedisReceiver.scala) • Note: Kafka receiver will consume an entire core (no context switch)
  • 13. RealTime DataProcessing with Spark Streaming Brandon O’Brien Oct 26th, 2016

Editor's Notes

  • #3: Streaming data, streaming customer behavior, up to thousands per second, Kafka cluster, new apps produce/consume streams, Spark higher level than Storm.