SlideShare a Scribd company logo
Spark vs Flink
Rumble in the (Big Data) Jungle
,
München, 2016-04-20
Konstantin Knauf Michael Pisula
Background
The Big Data Ecosystem
Apache Top-Level Projects over Time
2008 2010 2013 2014 2015
The New Guard
Berkeley University Origin TU Berlin
2013 Apache
Incubator
04/2014
02/2014 Apache Top-
Level
01/2015
databricks Company data Artisans
Scala, Java, Python, R Supported
languages
Java, Scala, Python
Scala Implemented
in
Java
Stand-Alone, Mesos,
EC2, YARN
Cluster Stand-Alone, Mesos, EC2, YARN
Lightning-fast cluster
computing
Teaser Scalable Batch and Stream
Data Processing
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink
The Challenge
Real-Time Analysis of a Superhero Fight Club
Fight
hitter: Int
hittee: Int
hitpoints: Int
Segment
id: Int
name: String
segment: String
Detail
name: String
gender: Int
birthYear: Int
noOfAppearances: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Hero
id: Int
name: String
segment: String
gender: Int
birthYear: Int
noOfAppearances: Int
{Stream
{Batch
The Setup
AWS Cluster
Kafka
Cluster
Stream ProcessingBatch Processing
Heroes
Combining Stream and Batch
Segment Detail Data Generator
Avro
Avro
Round 1
Setting up
Dependencies
compile "org.apache.flink:flink-java:1.0.0"
compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"
//For Local Execution from IDE
compile "org.apache.flink:flink-clients_2.11:1.0.0"
Skeleton
//Batch (DataSetAPI)
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Stream (DataStream API)
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
//Processing Logic
//For Streaming
env.execute()
Dependencies
compile 'org.apache.spark:spark-core_2.10:1.5.0'
compile 'org.apache.spark:spark-streaming_2.10:1.5.0'
Skeleton Batch
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
// Batch
JavaSparkContext sparkContext = new JavaSparkContext(conf);
// Stream
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));
// Processing Logic
jssc.start(); // For Streaming
First Impressions
Practically no boiler plate
Easy to get started and play around
Runs in the IDE
Hadoop MapReduce is much harder to get into
Round 2
Static Data Analysis
Combine both static data parts
Read the csv file and transform it
JavaRDD<String> segmentFile = sparkContext.textFile("s3://...");
JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile
.map(line -> line.split(","))
.filter(array -> array.length == 3)
.mapToPair((String[] parts) -> {
int id = Integer.parseInt(parts[0]);
String name = parts[1], segment = parts[2];
return new Tuple2<>(name, new SegmentTableRecord(id, name, segment));
});
Join with detail data, filter out humans and write output
segmentTable.join(detailTable)
.mapValues(tuple -> {
SegmentTableRecord s = tuple._1();
DetailTableRecord d = tuple._2();
return new Hero(s.getId(), s.getName(), s.getSegment(),
d.getGender(), d.getBirthYear(), d.getNoOfAppearances());
})
.map(tuple -> tuple._2())
.filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT))
.saveAsTextFile("s3://...");
Loading Files from S3 into POJO
DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...")
.ignoreInvalidLines()
.pojoType(SegmentTableRecord.class, "id", "name", "segment");
Join and Filter
DataSet<Hero> humanHeros = segmentTable.join(detailTable)
.where("name")
.equalTo("name")
.with((s, d) -> new Hero(s.id, s.name, s.segment,
d.gender, d.birthYear, d.noOfAppearances))
.filter(hero -> hero.segment.equals("Human"));
Write back to S3
humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE,
h -> h.toCsv());
Performance
Terasort1: Flink ca 66% of runtime
Terasort2: Flink ca. 68% of runtime
HashJoin: Flink ca. 32% of runtime
(Iterative Processes: Flink ca. 50% of runtime, ca. 7% with
Delta-Iterations)
2nd Round Points
Generally similar abstraction and feature set
Flink has a nicer syntax, more sugar
Spark is pretty bare-metal
Flink is faster
Round 3
Simple Real Time Analysis
Total Hitpoints over Last Minute
Configuring Environment for EventTime
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
ExecutionConfig config = env.getConfig();
config.setAutoWatermarkInterval(500);
Creating Stream from Kafka
Properties properties = new Properties();
properties.put("bootstrap.servers", KAFKA_BROKERS);
properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);
properties.put("group.id", KAFKA_GROUP_ID);
DataStreamSource<FightEvent> hitStream =
env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic",
new FightEventDeserializer(),
properties));
Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000))
.timeWindowAll(Time.of(60, TimeUnit.SECONDS),
Time.of(10, TimeUnit.SECONDS))
.apply(new SumAllWindowFunction<FightEvent>() {
@Override
public long getSummand(FightEvent fightEvent) {
return fightEvent.getHitPoints();
}
})
.writeAsCsv("s3://...");
Example Output
3> (1448130670000,1448130730000,290789)
4> (1448130680000,1448130740000,289395)
5> (1448130690000,1448130750000,291768)
6> (1448130700000,1448130760000,292634)
7> (1448130710000,1448130770000,293869)
8> (1448130720000,1448130780000,293356)
1> (1448130730000,1448130790000,293054)
2> (1448130740000,1448130800000,294209)
Create Context and get Avro Stream from Kafka
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));
HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic");
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "xxx:11211");
kafkaParams.put("group.id", "spark");
JavaPairInputDStream<String, FightEvent> kafkaStream =
KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class,
StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);
Analyze number of hit points over a sliding window
kafkaStream.map(tuple -> tuple._2().getHitPoints())
.reduceByWindow((hit1, hit2) -> hit1 + hit2,
Durations.seconds(60), Durations.seconds(10))
.foreachRDD((rdd, time) -> {
rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds());
LOGGER.info("Hitpoints in the last minute {}", rdd.take(5));
return null;
});
Output
20:19:32 Hitpoints in the last minute [80802]
20:19:42 Hitpoints in the last minute [101019]
20:19:52 Hitpoints in the last minute [141012]
20:20:02 Hitpoints in the last minute [184759]
20:20:12 Hitpoints in the last minute [215802]
3rd Round Points
Flink supports event time windows
Kafka and Avro worked seamlessly in both
Spark uses micro-batches, no real stream
Both have at-least-once delivery guarantees
Exactly-once depends a lot on sink/source
Round 4
Connecting Static Data with Real
Time Data
Total Hitpoints over Last Minute Per Gender
Read static data using object File and map genders
JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);
JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> {
int genderIndicator = user.getGender();
String gender;
switch (genderIndicator) {
case 1: gender = "MALE"; break;
case 2: gender = "FEMALE"; break;
default: gender = "OTHER"; break;
}
return new Tuple2<>(user.getId(), gender);
});
Analyze number of hit points per hitter over a sliding window
JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream
.mapToPair(tuple -> {
FightEvent fight = tuple._2();
return new Tuple2<>(fight.getHitterId(), fight.getHitPoints());
})
.reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2,
Durations.seconds(60),
Durations.seconds(10));
Join with static data to find gender for each hitter
hitpointWindowedStream.foreachRDD((rdd, time) -> {
JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup)
.mapToPair(joinedTuple -> {
Optional<String> maybeGender = joinedTuple._2()._2();
Long hitpoints = joinedTuple._2()._1();
return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints);
})
.reduceByKey((hit1, hit2) -> hit1 + hit2);
hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds());
LOGGER.info("Hitpoints per gender {}", hpg.take(5));
return null;
});
Output
20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]
20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)]
20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)]
20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)]
20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]
Loading Static Data in Every Map
public FightEventEnricher(String bucket, String keyPrefix) {
this.bucket = bucket;
this.keyPrefix = keyPrefix;
}
@Override
public void open(Configuration parameters) {
populateHeroMapFromS3(bucket, keyPrefix);
}
@Override
public EnrichedFightEvent map(FightEvent event) throws Exception {
return new EnrichedFightEvent(event,
idToHero.get(event.getHitterId()),
idToHero.get(event.getHitteeId()));
}
private void populateHeroMapFromS3(String bucket, String keyPrefix) {
// Omitted
}
Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000))
.map(new FightEventEnricher("s3_bucket", "output/heros"))
.filter(value -> value.getHittingHero() != null)
.keyBy(enrichedFightEvent ->
enrichedFightEvent.getHittingHero().getGender())
.timeWindow(Time.of(60, TimeUnit.SECONDS),
Time.of(10, TimeUnit.SECONDS))
.apply(new SumWindowFunction<EnrichedFightEvent, Integer>() {
@Override
public long getSummand(EnrichedFightEvent value) {
return value.getFightEvent()
.getHitPoints();
}
})
Example Output
2> (1448191350000,1448191410000,1,28478)
3> (1448191350000,1448191410000,2,264650)
2> (1448191360000,1448191420000,1,28290)
3> (1448191360000,1448191420000,2,263521)
2> (1448191370000,1448191430000,1,29327)
3> (1448191370000,1448191430000,2,265526)
4th Round Points
Spark makes combining batch and spark easier
Windowing by key works well in both
Java API of Spark can be annoying
Round 5
More Advanced Real Time
Analysis
Best Hitter over Last Minute Per Gender
Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000))
.map(new FightEventEnricher("s3_bucket", "output/heros"))
.filter(value -> value.getHittingHero() != null)
.keyBy(fightEvent -> fightEvent.getHittingHero().getName())
.timeWindow(Time.of(60, TimeUnit.SECONDS),
Time.of(10, TimeUnit.SECONDS))
.apply(new SumWindowFunction<EnrichedFightEvent, String>() {
@Override
public long getSummand(EnrichedFightEvent value) {
return value.getFightEvent().getHitPoints();
}
})
.assignTimestamps(new AscendingTimestampExtractor<...>() {
@Override
public long extractAscendingTimestamp(Tuple4<...<tuple, long l) {
return tuple.f0;
}
})
.timeWindowAll(Time.of(10, TimeUnit.SECONDS))
.maxBy(3)
.print();
Example Output
1> (1448200070000,1448200130000,Tengu,546)
2> (1448200080000,1448200140000,Louis XIV,621)
3> (1448200090000,1448200150000,Louis XIV,561)
4> (1448200100000,1448200160000,Louis XIV,552)
5> (1448200110000,1448200170000,Phil Dexter,620)
6> (1448200120000,1448200180000,Phil Dexter,552)
7> (1448200130000,1448200190000,Kalamity,648)
8> (1448200140000,1448200200000,Jakita Wagner,656)
1> (1448200150000,1448200210000,Jakita Wagner,703)
Read static data using object File and Map names
JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);
JavaPairRDD<String, String> userNameLookup = staticRdd
.mapToPair(user -> new Tuple2<>(user.getId(), user.getName()));
Analyze number of hit points per hitter over a sliding window
JavaPairDStream<String, Long> hitters = kafkaStream
.mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(),
kafkaTuple._2().getHitPoints()))
.reduceByKeyAndWindow((accum, current) -> accum + current,
(accum, remove) -> accum - remove,
Durations.seconds(60),
Durations.seconds(10));
Join with static data to find username for each hitter
hitters.foreachRDD((rdd, time) -> {
JavaRDD<Tuple2<String, Long>> namedHitters = rdd
.leftOuterJoin(userNameLookup)
.map(joinedTuple -> {
String username = joinedTuple._2()._2().or("No name");
Long hitpoints = joinedTuple._2()._1();
return new Tuple2<>(username, hitpoints);
})
.sortBy(Tuple2::_2, false, PARTITIONS);
namedHitters.saveAsTextFile(outputPath + "/round3-" + time);
LOGGER.info("Five highest hitters (total: {}){}",
namedHitters.count(), namedHitters.take(5));
return null;
});
Output
15/11/25 20:34:23 Five highest hitters (total: 200)
[(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539),
15/11/25 20:34:33 Five highest hitters (total: 378)
[(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646),
15/11/25 20:34:43 Five highest hitters (total: 378)
[(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699),
15/11/25 20:34:53 Five highest hitters (total: 558)
[(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,
Performance
Yahoo Streaming Benchmark
5th Round Points
Spark makes some things easier
But Flink is real streaming
In Spark you often have to specify partitions
The Judges' Call
Development
Compared to Hadoop, both are awesome
Both provide unified programming model for diverse scenarios
Comfort level of abstraction varies with use-case
Spark's Java API is cumbersome compared to the Scala API
Working with both is fun
Docs are ok, but spotty
Testing
Testing distributed systems will always be hard
Functionally both can be tested nicely
Monitoring
Monitoring
Community
The Judge's Call
It depends...
Use Spark, if
You have Cloudera, Hortonworks. etc support and depend on it
You want to heavily use Graph and ML libraries
You want to use the more mature project
Use Flink, if
Real-Time processing is important for your use case
You want more complex window operations
You develop in Java only
If you want to support a German project
Benchmark References
[1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-
flink-batch-processing/
[2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html
[3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html
[4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
[5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
Thank You!
Questions?
 michael.pisula@tng.tech konstantin.knauf@tng.tech

More Related Content

PDF
#살아있다 #자프링외길12년차 #코프링2개월생존기
PDF
Scalding for Hadoop
PPTX
Scoobi - Scala for Startups
PDF
Cascading Through Hadoop for the Boulder JUG
PPT
Fast querying indexing for performance (4)
PPTX
IPC: AIDL is sexy, not a curse
PPTX
Ipc: aidl sexy, not a curse
PDF
Ts archiving
#살아있다 #자프링외길12년차 #코프링2개월생존기
Scalding for Hadoop
Scoobi - Scala for Startups
Cascading Through Hadoop for the Boulder JUG
Fast querying indexing for performance (4)
IPC: AIDL is sexy, not a curse
Ipc: aidl sexy, not a curse
Ts archiving

What's hot (20)

PDF
Indexing and Query Optimizer (Mongo Austin)
PPTX
Indexing and Query Optimization
PDF
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
PDF
Vavr Java User Group Rheinland
PPTX
Indexing & Query Optimization
PPTX
Reducing Development Time with MongoDB vs. SQL
PDF
Full Text Search In PostgreSQL
PDF
elasticsearch - advanced features in practice
PPTX
Morphia, Spring Data & Co.
PPTX
Java Persistence Frameworks for MongoDB
PPTX
SH 1 - SES 3 - 3.6-Overview-Tel-Aviv.pptx
KEY
Elasticsearch & "PeopleSearch"
PDF
Spring Data JPA from 0-100 in 60 minutes
PDF
PDF
Data access 2.0? Please welcome: Spring Data!
PDF
An introduction into Spring Data
PPTX
Indexing with MongoDB
PPTX
Indexing and Query Optimizer (Aaron Staple)
PDF
Look Ma, “update DB to HTML5 using C++”, no hands! 
PDF
DevFest Istanbul - a free guided tour of Neo4J
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimization
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Vavr Java User Group Rheinland
Indexing & Query Optimization
Reducing Development Time with MongoDB vs. SQL
Full Text Search In PostgreSQL
elasticsearch - advanced features in practice
Morphia, Spring Data & Co.
Java Persistence Frameworks for MongoDB
SH 1 - SES 3 - 3.6-Overview-Tel-Aviv.pptx
Elasticsearch & "PeopleSearch"
Spring Data JPA from 0-100 in 60 minutes
Data access 2.0? Please welcome: Spring Data!
An introduction into Spring Data
Indexing with MongoDB
Indexing and Query Optimizer (Aaron Staple)
Look Ma, “update DB to HTML5 using C++”, no hands! 
DevFest Istanbul - a free guided tour of Neo4J
Ad

Viewers also liked (12)

PDF
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
PPTX
Getting started in Apache Spark and Flink (with Scala) - Part II
PDF
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
PDF
Apache Spark vs Apache Flink
PDF
Why Apache Flink is better than Spark by Rubén Casado
PPTX
Java 9 Modularity and Project Jigsaw
PPTX
Apache Flink at Strata San Jose 2016
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
PDF
스사모 테크톡 - Apache Flink 둘러보기
PDF
Apache Spark & Hadoop : Train-the-trainer
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Getting started in Apache Spark and Flink (with Scala) - Part II
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Apache Spark vs Apache Flink
Why Apache Flink is better than Spark by Rubén Casado
Java 9 Modularity and Project Jigsaw
Apache Flink at Strata San Jose 2016
Continuous Processing with Apache Flink - Strata London 2016
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
스사모 테크톡 - Apache Flink 둘러보기
Apache Spark & Hadoop : Train-the-trainer
Ad

Similar to 21.04.2016 Meetup: Spark vs. Flink (20)

PDF
Apache Flink 101 - the rise of stream processing and beyond
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
Introduction to Apache Flink
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PPTX
Apache Spark Components
PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Expand data analysis tool at scale with Zeppelin
PDF
Data Streaming For Big Data
PDF
Apache: Big Data - Starting with Apache Spark, Best Practices
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
PDF
Stream Processing with Apache Flink
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
Apache Flink: Past, Present and Future
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
PPTX
Emerging technologies /frameworks in Big Data
PDF
Apache Spark Streaming
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Apache Flink 101 - the rise of stream processing and beyond
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Introduction to Apache Flink
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache Spark Components
Building highly scalable data pipelines with Apache Spark
Expand data analysis tool at scale with Zeppelin
Data Streaming For Big Data
Apache: Big Data - Starting with Apache Spark, Best Practices
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Stream Processing with Apache Flink
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink: Past, Present and Future
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Emerging technologies /frameworks in Big Data
Apache Spark Streaming
Why apache Flink is the 4G of Big Data Analytics Frameworks

More from Comsysto Reply GmbH (20)

PDF
Architectural Decisions: Smoothly and Consistently
PDF
ljug-meetup-2023-03-hexagonal-architecture.pdf
PDF
Software Architecture and Architectors: useless VS valuable
PDF
Invited-Talk_PredAnalytics_München (2).pdf
PDF
MicroFrontends für Microservices
PDF
Alles offen = gut(ai)
PDF
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
PDF
Smart City Munich Kickoff Meetup
PDF
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
PDF
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
PDF
Data lake vs Data Warehouse: Hybrid Architectures
PDF
Building a fully-automated Fast Data Platform
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
PDF
Geospatial applications created using java script(and nosql)
PDF
Java cro 2016 - From.... to Scrum by Jurica Krizanic
PDF
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
PDF
Machinelearning Spark Hadoop User Group Munich Meetup 2016
PDF
SPARK STREAMING Spark Hadoop User Group Munich Meetup 2016
PDF
Caching and JCache with Greg Luck 18.02.16
Architectural Decisions: Smoothly and Consistently
ljug-meetup-2023-03-hexagonal-architecture.pdf
Software Architecture and Architectors: useless VS valuable
Invited-Talk_PredAnalytics_München (2).pdf
MicroFrontends für Microservices
Alles offen = gut(ai)
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Smart City Munich Kickoff Meetup
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
Data lake vs Data Warehouse: Hybrid Architectures
Building a fully-automated Fast Data Platform
Apache Apex: Stream Processing Architecture and Applications
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Geospatial applications created using java script(and nosql)
Java cro 2016 - From.... to Scrum by Jurica Krizanic
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
SPARK STREAMING Spark Hadoop User Group Munich Meetup 2016
Caching and JCache with Greg Luck 18.02.16

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Lecture1 pattern recognition............
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Introduction to Business Data Analytics.
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Computer network topology notes for revision
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Lecture1 pattern recognition............
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Business Data Analytics.
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
1_Introduction to advance data techniques.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Clinical guidelines as a resource for EBP(1).pdf
Computer network topology notes for revision

21.04.2016 Meetup: Spark vs. Flink

  • 1. Spark vs Flink Rumble in the (Big Data) Jungle , München, 2016-04-20 Konstantin Knauf Michael Pisula
  • 3. The Big Data Ecosystem Apache Top-Level Projects over Time 2008 2010 2013 2014 2015
  • 5. Berkeley University Origin TU Berlin 2013 Apache Incubator 04/2014 02/2014 Apache Top- Level 01/2015 databricks Company data Artisans Scala, Java, Python, R Supported languages Java, Scala, Python Scala Implemented in Java Stand-Alone, Mesos, EC2, YARN Cluster Stand-Alone, Mesos, EC2, YARN Lightning-fast cluster computing Teaser Scalable Batch and Stream Data Processing
  • 9. Real-Time Analysis of a Superhero Fight Club Fight hitter: Int hittee: Int hitpoints: Int Segment id: Int name: String segment: String Detail name: String gender: Int birthYear: Int noOfAppearances: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Hero id: Int name: String segment: String gender: Int birthYear: Int noOfAppearances: Int {Stream {Batch
  • 10. The Setup AWS Cluster Kafka Cluster Stream ProcessingBatch Processing Heroes Combining Stream and Batch Segment Detail Data Generator Avro Avro
  • 12. Dependencies compile "org.apache.flink:flink-java:1.0.0" compile "org.apache.flink:flink-streaming-java_2.11:1.0.0" //For Local Execution from IDE compile "org.apache.flink:flink-clients_2.11:1.0.0" Skeleton //Batch (DataSetAPI) ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //Stream (DataStream API) StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment() //Processing Logic //For Streaming env.execute()
  • 13. Dependencies compile 'org.apache.spark:spark-core_2.10:1.5.0' compile 'org.apache.spark:spark-streaming_2.10:1.5.0' Skeleton Batch SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); // Batch JavaSparkContext sparkContext = new JavaSparkContext(conf); // Stream JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); // Processing Logic jssc.start(); // For Streaming
  • 14. First Impressions Practically no boiler plate Easy to get started and play around Runs in the IDE Hadoop MapReduce is much harder to get into
  • 15. Round 2 Static Data Analysis Combine both static data parts
  • 16. Read the csv file and transform it JavaRDD<String> segmentFile = sparkContext.textFile("s3://..."); JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2<>(name, new SegmentTableRecord(id, name, segment)); }); Join with detail data, filter out humans and write output segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");
  • 17. Loading Files from S3 into POJO DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment"); Join and Filter DataSet<Hero> humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human")); Write back to S3 humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());
  • 18. Performance Terasort1: Flink ca 66% of runtime Terasort2: Flink ca. 68% of runtime HashJoin: Flink ca. 32% of runtime (Iterative Processes: Flink ca. 50% of runtime, ca. 7% with Delta-Iterations)
  • 19. 2nd Round Points Generally similar abstraction and feature set Flink has a nicer syntax, more sugar Spark is pretty bare-metal Flink is faster
  • 20. Round 3 Simple Real Time Analysis Total Hitpoints over Last Minute
  • 21. Configuring Environment for EventTime StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); ExecutionConfig config = env.getConfig(); config.setAutoWatermarkInterval(500); Creating Stream from Kafka Properties properties = new Properties(); properties.put("bootstrap.servers", KAFKA_BROKERS); properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION); properties.put("group.id", KAFKA_GROUP_ID); DataStreamSource<FightEvent> hitStream = env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic", new FightEventDeserializer(), properties));
  • 22. Processing Logic hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction<FightEvent>() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://..."); Example Output 3> (1448130670000,1448130730000,290789) 4> (1448130680000,1448130740000,289395) 5> (1448130690000,1448130750000,291768) 6> (1448130700000,1448130760000,292634) 7> (1448130710000,1448130770000,293869) 8> (1448130720000,1448130780000,293356) 1> (1448130730000,1448130790000,293054) 2> (1448130740000,1448130800000,294209)
  • 23. Create Context and get Avro Stream from Kafka JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic"); HashMap<String, String> kafkaParams = new HashMap<String, String>(); kafkaParams.put("metadata.broker.list", "xxx:11211"); kafkaParams.put("group.id", "spark"); JavaPairInputDStream<String, FightEvent> kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet); Analyze number of hit points over a sliding window kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });
  • 24. Output 20:19:32 Hitpoints in the last minute [80802] 20:19:42 Hitpoints in the last minute [101019] 20:19:52 Hitpoints in the last minute [141012] 20:20:02 Hitpoints in the last minute [184759] 20:20:12 Hitpoints in the last minute [215802]
  • 25. 3rd Round Points Flink supports event time windows Kafka and Avro worked seamlessly in both Spark uses micro-batches, no real stream Both have at-least-once delivery guarantees Exactly-once depends a lot on sink/source
  • 26. Round 4 Connecting Static Data with Real Time Data Total Hitpoints over Last Minute Per Gender
  • 27. Read static data using object File and map genders JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath); JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2<>(user.getId(), gender); }); Analyze number of hit points per hitter over a sliding window JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2<>(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));
  • 28. Join with static data to find gender for each hitter hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional<String> maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null; }); Output 20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)] 20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)] 20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)] 20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)] 20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]
  • 29. Loading Static Data in Every Map public FightEventEnricher(String bucket, String keyPrefix) { this.bucket = bucket; this.keyPrefix = keyPrefix; } @Override public void open(Configuration parameters) { populateHeroMapFromS3(bucket, keyPrefix); } @Override public EnrichedFightEvent map(FightEvent event) throws Exception { return new EnrichedFightEvent(event, idToHero.get(event.getHitterId()), idToHero.get(event.getHitteeId())); } private void populateHeroMapFromS3(String bucket, String keyPrefix) { // Omitted }
  • 30. Processing Logic hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(enrichedFightEvent -> enrichedFightEvent.getHittingHero().getGender()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, Integer>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent() .getHitPoints(); } }) Example Output 2> (1448191350000,1448191410000,1,28478) 3> (1448191350000,1448191410000,2,264650) 2> (1448191360000,1448191420000,1,28290) 3> (1448191360000,1448191420000,2,263521) 2> (1448191370000,1448191430000,1,29327) 3> (1448191370000,1448191430000,2,265526)
  • 31. 4th Round Points Spark makes combining batch and spark easier Windowing by key works well in both Java API of Spark can be annoying
  • 32. Round 5 More Advanced Real Time Analysis Best Hitter over Last Minute Per Gender
  • 33. Processing Logic hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(fightEvent -> fightEvent.getHittingHero().getName()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, String>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent().getHitPoints(); } }) .assignTimestamps(new AscendingTimestampExtractor<...>() { @Override public long extractAscendingTimestamp(Tuple4<...<tuple, long l) { return tuple.f0; } }) .timeWindowAll(Time.of(10, TimeUnit.SECONDS)) .maxBy(3) .print();
  • 34. Example Output 1> (1448200070000,1448200130000,Tengu,546) 2> (1448200080000,1448200140000,Louis XIV,621) 3> (1448200090000,1448200150000,Louis XIV,561) 4> (1448200100000,1448200160000,Louis XIV,552) 5> (1448200110000,1448200170000,Phil Dexter,620) 6> (1448200120000,1448200180000,Phil Dexter,552) 7> (1448200130000,1448200190000,Kalamity,648) 8> (1448200140000,1448200200000,Jakita Wagner,656) 1> (1448200150000,1448200210000,Jakita Wagner,703)
  • 35. Read static data using object File and Map names JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath); JavaPairRDD<String, String> userNameLookup = staticRdd .mapToPair(user -> new Tuple2<>(user.getId(), user.getName())); Analyze number of hit points per hitter over a sliding window JavaPairDStream<String, Long> hitters = kafkaStream .mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(), kafkaTuple._2().getHitPoints())) .reduceByKeyAndWindow((accum, current) -> accum + current, (accum, remove) -> accum - remove, Durations.seconds(60), Durations.seconds(10));
  • 36. Join with static data to find username for each hitter hitters.foreachRDD((rdd, time) -> { JavaRDD<Tuple2<String, Long>> namedHitters = rdd .leftOuterJoin(userNameLookup) .map(joinedTuple -> { String username = joinedTuple._2()._2().or("No name"); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(username, hitpoints); }) .sortBy(Tuple2::_2, false, PARTITIONS); namedHitters.saveAsTextFile(outputPath + "/round3-" + time); LOGGER.info("Five highest hitters (total: {}){}", namedHitters.count(), namedHitters.take(5)); return null; }); Output 15/11/25 20:34:23 Five highest hitters (total: 200) [(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539), 15/11/25 20:34:33 Five highest hitters (total: 378) [(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646), 15/11/25 20:34:43 Five highest hitters (total: 378) [(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699),
  • 37. 15/11/25 20:34:53 Five highest hitters (total: 558) [(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,
  • 39. 5th Round Points Spark makes some things easier But Flink is real streaming In Spark you often have to specify partitions
  • 41. Development Compared to Hadoop, both are awesome Both provide unified programming model for diverse scenarios Comfort level of abstraction varies with use-case Spark's Java API is cumbersome compared to the Scala API Working with both is fun Docs are ok, but spotty
  • 42. Testing Testing distributed systems will always be hard Functionally both can be tested nicely
  • 46. The Judge's Call It depends...
  • 47. Use Spark, if You have Cloudera, Hortonworks. etc support and depend on it You want to heavily use Graph and ML libraries You want to use the more mature project
  • 48. Use Flink, if Real-Time processing is important for your use case You want more complex window operations You develop in Java only If you want to support a German project
  • 49. Benchmark References [1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache- flink-batch-processing/ [2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html [3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html [4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/