SlideShare a Scribd company logo
Spark Streaming
/ @laclefyoshi
<ysaeki@r.recruit.co.jp>
•
• Spark Streaming
•
•
• Spark Streaming Tips
•
2
: / SAEKI Yoshiyasu
:
IT
: Web 4 9
R&D
Hadoop, Kafka, Storm, Spark, Druid
: RICOH Theta ( ) + Google Cardboard
3
Spark Streaming
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
4
5
•
• =
•
•
http://www.recruit.jp/company/about/structure.html
6
•
• ≒ …
•
• !
OS etc.
7
1. Web 

(JavaScript)
2. fluentd Kafka
8
: fluentd → Kafka
• fluent-plugin-kafka
• https://github.com/htgc/fluent-plugin-kafka
• output type = kafka_buffered (on file)
• Kafka 0.8.2.2
• 0.9.0
• ACL
9
10
Suro
• Netflix
• https://github.com/Netflix/suro
• : Kafka Consumer API Thrift API
• :
• HDFS
• AWS S3
• Kafka Producer
• Elasticsearch
•
11
LinkedIn
Gobblin
Hadoop
•
• HDFS
• MLlib 

• Streaming linear regression (Classification)
• Streaming k-means (Clustering)
•
12
Spark Streaming
13
Kafka
• Direct Approach (>= Spark 1.3)
•
• Exactly-once
• Kafka Simple Consumer API
Direct Approach
14
Spark Streaming 1
15
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
Spark Streaming 2
16
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
Micro-batch
17
1Micro-batch
(Cookie)
Window-based micro-batch
1
1Micro-batch1Micro-batch
18
Micro-batch
• RDD HBase
dstream.foreachRDD { rdd =>
val hbaseConf = createHbaseConfiguration()
val jobConf = new Configuration(hbaseConf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName)
jobConf.set("mapreduce.outputformat.class",
classOf[TableOutputFormat[Text]].getName)
new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf)
}
// RDD[(String, Map[K,V])] RDD[(String, Put)]
def hbaseConvert(t:(String, Map[String, String])) = {
val p = new Put(Bytes.toBytes(t._1))
t._2.toSeq.foreach(
m => p.addColumn(Bytes.toBytes("seg"),
Bytes.toBytes(m._1), Bytes.toBytes(m._2))
)
(t._1, p)
}
19
0.5 1
20
Spark Streaming :
• DStream RDD
• Spark 

Spark Streaming
21
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
Spark Streaming :
• Fault Tolerance
• Micro-batch
• YARN
• YARN Dynamic Resource Allocation
•
22
Spark Streaming :
• : → 

RDD → RDD DStream → DStream
• 1Micro-batch
23
// RDD → RDD
val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c"))
// DStream → DStream
val queue = scala.collection.mutable.Queue(rdd)
val dstream:DStream[String] =
sparkStreamingContext.queueStream(queue)
Spark Streaming :
• spark-testing-base
• https://github.com/holdenk/spark-testing-base
class JsonElementCountTest extends StreamingSuiteBase {
test("simple") {
val input = List(List("aa"), List("bb"))
val expected = List(List("AA"), List(“BB"))
testOperation[String, String](
input, converterMethod _, expected, useSet = true)
}

}
24
Spark Streaming :
• Window-based micro-batch
•
• o.a.spark.streaming.util.ManualClock

• private class Scala
• http://mkuthan.github.io/blog/2015/03/01/spark-
unit-testing/
25
Spark Streaming :
• Scala Java
•
• Spark Streaming Kafka HBase Scala
• Java
26
// api/java/JavaRDD.scala
object JavaRDD {
implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] =
new JavaRDD[T](rdd)
implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd
}
27
•
•
• =
• Spark Streaming
• MLlib
• GraphX

More Related Content

PDF
Apache Kafka 0.11 の Exactly Once Semantics
PDF
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
PDF
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
PDF
ストリーミングデータのアドホック分析エンジンの比較
PDF
グラフデータベース Neptune 使ってみた
PDF
Queryable State for Kafka Streamsを使ってみた
PDF
KafkaとAWS Kinesisの比較
PDF
データの民主化のために StackStorm を活用した事例
Apache Kafka 0.11 の Exactly Once Semantics
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
ストリーミングデータのアドホック分析エンジンの比較
グラフデータベース Neptune 使ってみた
Queryable State for Kafka Streamsを使ってみた
KafkaとAWS Kinesisの比較
データの民主化のために StackStorm を活用した事例

What's hot (20)

PDF
Voldemortの紹介
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
PDF
PWL: One VM to Rule Them All
PDF
Facebook Presto presentation
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PDF
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PDF
Ruby and Distributed Storage Systems
PDF
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
PDF
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
PDF
Technologies for Data Analytics Platform
PDF
Apache Kafka lessons learned @PAYBACK
PDF
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
PPTX
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
PDF
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
PDF
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
PDF
Api world apache nifi 101
PDF
Apache Pulsar Community-Jennifer
Voldemortの紹介
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
PWL: One VM to Rule Them All
Facebook Presto presentation
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Ruby and Distributed Storage Systems
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
Spark Compute as a Service at Paypal with Prabhu Kasinathan
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
Technologies for Data Analytics Platform
Apache Kafka lessons learned @PAYBACK
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Api world apache nifi 101
Apache Pulsar Community-Jennifer
Ad

Viewers also liked (20)

PDF
ストリーム処理を支えるキューイングシステムの選び方
PPTX
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
PDF
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
PDF
ビッグじゃなくても使えるSpark Streaming
PDF
Apache Spark の紹介(前半:Sparkのキホン)
PDF
Fast Distributed Online Classification
PDF
Training Large-scale Ad Ranking Models in Spark
PDF
Run Spark on EMRってどんな仕組みになってるの?
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
2015-01-27 Introduction to Docker
PPTX
'Flume' Case Study
PDF
Tokyo webmining発表資料 20111127
PPTX
Apache flume
PDF
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
PPTX
データセンタにおける消費電力のお話
PDF
Way of Experiment & Evaluation
PDF
Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December 2015/12/09
ODP
FreeBSD on Mac
PDF
PDF
Apache Sparkについて
ストリーム処理を支えるキューイングシステムの選び方
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
ビッグじゃなくても使えるSpark Streaming
Apache Spark の紹介(前半:Sparkのキホン)
Fast Distributed Online Classification
Training Large-scale Ad Ranking Models in Spark
Run Spark on EMRってどんな仕組みになってるの?
Apache Spark: The Next Gen toolset for Big Data Processing
2015-01-27 Introduction to Docker
'Flume' Case Study
Tokyo webmining発表資料 20111127
Apache flume
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
データセンタにおける消費電力のお話
Way of Experiment & Evaluation
Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December 2015/12/09
FreeBSD on Mac
Apache Sparkについて
Ad

Similar to Spark Streamingによるリアルタイムユーザ属性推定 (20)

PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PPTX
Introduction Apache Kafka
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
ETL with SPARK - First Spark London meetup
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
PPTX
Flink September 2015 Community Update
PDF
15年前に作ったアプリを現在に蘇らせてみた話
PDF
PySpark Best Practices
PDF
リバースプロキシで webサーバを集約 ついでにdocker化しよう
PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
PDF
Top 5 mistakes when writing Streaming applications
PDF
Ingesting hdfs intosolrusingsparktrimmed
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PDF
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
PPTX
Apache Kafka 0.8 basic training - Verisign
PPTX
Introduction to real time big data with Apache Spark
PDF
Spark Summit EU talk by Jim Dowling
PDF
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Introduction Apache Kafka
Scalding by Adform Research, Alex Gryzlov
ETL with SPARK - First Spark London meetup
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Flink September 2015 Community Update
15年前に作ったアプリを現在に蘇らせてみた話
PySpark Best Practices
リバースプロキシで webサーバを集約 ついでにdocker化しよう
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Top 5 mistakes when writing Streaming applications
Ingesting hdfs intosolrusingsparktrimmed
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
Apache Kafka 0.8 basic training - Verisign
Introduction to real time big data with Apache Spark
Spark Summit EU talk by Jim Dowling
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Real time Analytics with Apache Kafka and Apache Spark

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Computer network topology notes for revision
PDF
Mega Projects Data Mega Projects Data
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Database Infoormation System (DBIS).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Launch Your Data Science Career in Kochi – 2025
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Foundation of Data Science unit number two notes
Computer network topology notes for revision
Mega Projects Data Mega Projects Data
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Clinical guidelines as a resource for EBP(1).pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Moving the Public Sector (Government) to a Digital Adoption
Database Infoormation System (DBIS).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
STUDY DESIGN details- Lt Col Maksud (21).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Launch Your Data Science Career in Kochi – 2025

Spark Streamingによるリアルタイムユーザ属性推定