21.04.2016 Meetup: Spark vs. Flink

Spark vs Flink
Rumble in the (Big Data) Jungle
,
München, 2016-04-20
Konstantin Knauf Michael Pisula

The Big Data Ecosystem
Apache Top-Level Projects over Time
2008 2010 2013 2014 2015

Berkeley University Origin TU Berlin
2013 Apache
Incubator
04/2014
02/2014 Apache Top-
Level
01/2015
databricks Company data Artisans
Scala, Java, Python, R Supported
languages
Java, Scala, Python
Scala Implemented
in
Java
Stand-Alone, Mesos,
EC2, YARN
Cluster Stand-Alone, Mesos, EC2, YARN
Lightning-fast cluster
computing
Teaser Scalable Batch and Stream
Data Processing

Real-Time Analysis of a Superhero Fight Club
Fight
hitter: Int
hittee: Int
hitpoints: Int
Segment
id: Int
name: String
segment: String
Detail
name: String
gender: Int
birthYear: Int
noOfAppearances: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Fight
hitter: Int
hittee: Int
hitpoints: Int
Hero
id: Int
name: String
segment: String
gender: Int
birthYear: Int
noOfAppearances: Int
{Stream
{Batch

The Setup
AWS Cluster
Kafka
Cluster
Stream ProcessingBatch Processing
Heroes
Combining Stream and Batch
Segment Detail Data Generator
Avro
Avro

Dependencies
compile "org.apache.flink:flink-java:1.0.0"
compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"
//For Local Execution from IDE
compile "org.apache.flink:flink-clients_2.11:1.0.0"
Skeleton
//Batch (DataSetAPI)
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Stream (DataStream API)
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
//Processing Logic
//For Streaming
env.execute()

Dependencies
compile 'org.apache.spark:spark-core_2.10:1.5.0'
compile 'org.apache.spark:spark-streaming_2.10:1.5.0'
Skeleton Batch
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
// Batch
JavaSparkContext sparkContext = new JavaSparkContext(conf);
// Stream
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));
// Processing Logic
jssc.start(); // For Streaming

First Impressions
Practically no boiler plate
Easy to get started and play around
Runs in the IDE
Hadoop MapReduce is much harder to get into

Round 2
Static Data Analysis
Combine both static data parts

Read the csv ﬁle and transform it
JavaRDD<String> segmentFile = sparkContext.textFile("s3://...");
JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile
.map(line -> line.split(","))
.filter(array -> array.length == 3)
.mapToPair((String[] parts) -> {
int id = Integer.parseInt(parts[0]);
String name = parts[1], segment = parts[2];
return new Tuple2<>(name, new SegmentTableRecord(id, name, segment));
});
Join with detail data, ﬁlter out humans and write output
segmentTable.join(detailTable)
.mapValues(tuple -> {
SegmentTableRecord s = tuple._1();
DetailTableRecord d = tuple._2();
return new Hero(s.getId(), s.getName(), s.getSegment(),
d.getGender(), d.getBirthYear(), d.getNoOfAppearances());
})
.map(tuple -> tuple._2())
.filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT))
.saveAsTextFile("s3://...");

Loading Files from S3 into POJO
DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...")
.ignoreInvalidLines()
.pojoType(SegmentTableRecord.class, "id", "name", "segment");
Join and Filter
DataSet<Hero> humanHeros = segmentTable.join(detailTable)
.where("name")
.equalTo("name")
.with((s, d) -> new Hero(s.id, s.name, s.segment,
d.gender, d.birthYear, d.noOfAppearances))
.filter(hero -> hero.segment.equals("Human"));
Write back to S3
humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE,
h -> h.toCsv());

Performance
Terasort1: Flink ca 66% of runtime
Terasort2: Flink ca. 68% of runtime
HashJoin: Flink ca. 32% of runtime
(Iterative Processes: Flink ca. 50% of runtime, ca. 7% with
Delta-Iterations)

2nd Round Points
Generally similar abstraction and feature set
Flink has a nicer syntax, more sugar
Spark is pretty bare-metal
Flink is faster

Round 3
Simple Real Time Analysis
Total Hitpoints over Last Minute

Conﬁguring Environment for EventTime
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
ExecutionConfig config = env.getConfig();
config.setAutoWatermarkInterval(500);
Creating Stream from Kafka
Properties properties = new Properties();
properties.put("bootstrap.servers", KAFKA_BROKERS);
properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);
properties.put("group.id", KAFKA_GROUP_ID);
DataStreamSource<FightEvent> hitStream =
env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic",
new FightEventDeserializer(),
properties));

Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000))
.timeWindowAll(Time.of(60, TimeUnit.SECONDS),
Time.of(10, TimeUnit.SECONDS))
.apply(new SumAllWindowFunction<FightEvent>() {
@Override
public long getSummand(FightEvent fightEvent) {
return fightEvent.getHitPoints();
}
})
.writeAsCsv("s3://...");
Example Output
3> (1448130670000,1448130730000,290789)
4> (1448130680000,1448130740000,289395)
5> (1448130690000,1448130750000,291768)
6> (1448130700000,1448130760000,292634)
7> (1448130710000,1448130770000,293869)
8> (1448130720000,1448130780000,293356)
1> (1448130730000,1448130790000,293054)
2> (1448130740000,1448130800000,294209)

Create Context and get Avro Stream from Kafka
JavaStreamingContext jssc = new JavaStreamingContext(conf,
HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic");
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "xxx:11211");
kafkaParams.put("group.id", "spark");
JavaPairInputDStream<String, FightEvent> kafkaStream =
KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class,
StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);
Analyze number of hit points over a sliding window
kafkaStream.map(tuple -> tuple._2().getHitPoints())
.reduceByWindow((hit1, hit2) -> hit1 + hit2,
Durations.seconds(60), Durations.seconds(10))
.foreachRDD((rdd, time) -> {
rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds());
LOGGER.info("Hitpoints in the last minute {}", rdd.take(5));
return null;
});

Output
20:19:32 Hitpoints in the last minute [80802]

3rd Round Points
Flink supports event time windows
Kafka and Avro worked seamlessly in both
Spark uses micro-batches, no real stream
Both have at-least-once delivery guarantees
Exactly-once depends a lot on sink/source

Round 4
Connecting Static Data with Real
Time Data
Total Hitpoints over Last Minute Per Gender

Read static data using object File and map genders
JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);
JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> {
int genderIndicator = user.getGender();
String gender;
switch (genderIndicator) {
case 1: gender = "MALE"; break;
case 2: gender = "FEMALE"; break;
default: gender = "OTHER"; break;
}
return new Tuple2<>(user.getId(), gender);
});
Analyze number of hit points per hitter over a sliding window
JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream
.mapToPair(tuple -> {
FightEvent fight = tuple._2();
return new Tuple2<>(fight.getHitterId(), fight.getHitPoints());
})
.reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2,
Durations.seconds(60),

Join with static data to ﬁnd gender for each hitter
hitpointWindowedStream.foreachRDD((rdd, time) -> {
JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup)
.mapToPair(joinedTuple -> {
Optional<String> maybeGender = joinedTuple._2()._2();
Long hitpoints = joinedTuple._2()._1();
return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints);
})
.reduceByKey((hit1, hit2) -> hit1 + hit2);
hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds());
LOGGER.info("Hitpoints per gender {}", hpg.take(5));
return null;
});
Output
20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]

Loading Static Data in Every Map
public FightEventEnricher(String bucket, String keyPrefix) {
this.bucket = bucket;
this.keyPrefix = keyPrefix;
}
@Override
public void open(Configuration parameters) {
populateHeroMapFromS3(bucket, keyPrefix);
}
@Override
public EnrichedFightEvent map(FightEvent event) throws Exception {
return new EnrichedFightEvent(event,
idToHero.get(event.getHitterId()),
idToHero.get(event.getHitteeId()));
}
private void populateHeroMapFromS3(String bucket, String keyPrefix) {
// Omitted
}

Processing Logic
.map(new FightEventEnricher("s3_bucket", "output/heros"))
.filter(value -> value.getHittingHero() != null)
.keyBy(enrichedFightEvent ->
enrichedFightEvent.getHittingHero().getGender())
.timeWindow(Time.of(60, TimeUnit.SECONDS),
.apply(new SumWindowFunction<EnrichedFightEvent, Integer>() {
@Override
public long getSummand(EnrichedFightEvent value) {
return value.getFightEvent()
.getHitPoints();
}
})
Example Output
2> (1448191350000,1448191410000,1,28478)
3> (1448191350000,1448191410000,2,264650)
2> (1448191360000,1448191420000,1,28290)
3> (1448191360000,1448191420000,2,263521)
2> (1448191370000,1448191430000,1,29327)
3> (1448191370000,1448191430000,2,265526)

4th Round Points
Spark makes combining batch and spark easier
Windowing by key works well in both
Java API of Spark can be annoying

Round 5
More Advanced Real Time
Analysis
Best Hitter over Last Minute Per Gender

Processing Logic
.map(new FightEventEnricher("s3_bucket", "output/heros"))
.filter(value -> value.getHittingHero() != null)
.keyBy(fightEvent -> fightEvent.getHittingHero().getName())
.timeWindow(Time.of(60, TimeUnit.SECONDS),
.apply(new SumWindowFunction<EnrichedFightEvent, String>() {
@Override
public long getSummand(EnrichedFightEvent value) {
return value.getFightEvent().getHitPoints();
}
})
.assignTimestamps(new AscendingTimestampExtractor<...>() {
@Override
public long extractAscendingTimestamp(Tuple4<...<tuple, long l) {
return tuple.f0;
}
})
.timeWindowAll(Time.of(10, TimeUnit.SECONDS))
.maxBy(3)
.print();

Example Output
1> (1448200070000,1448200130000,Tengu,546)
2> (1448200080000,1448200140000,Louis XIV,621)
3> (1448200090000,1448200150000,Louis XIV,561)
4> (1448200100000,1448200160000,Louis XIV,552)
5> (1448200110000,1448200170000,Phil Dexter,620)
6> (1448200120000,1448200180000,Phil Dexter,552)
7> (1448200130000,1448200190000,Kalamity,648)
8> (1448200140000,1448200200000,Jakita Wagner,656)
1> (1448200150000,1448200210000,Jakita Wagner,703)

Read static data using object File and Map names
JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);
JavaPairRDD<String, String> userNameLookup = staticRdd
.mapToPair(user -> new Tuple2<>(user.getId(), user.getName()));
Analyze number of hit points per hitter over a sliding window
JavaPairDStream<String, Long> hitters = kafkaStream
.mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(),
kafkaTuple._2().getHitPoints()))
.reduceByKeyAndWindow((accum, current) -> accum + current,
(accum, remove) -> accum - remove,
Durations.seconds(60),

Join with static data to ﬁnd username for each hitter
hitters.foreachRDD((rdd, time) -> {
JavaRDD<Tuple2<String, Long>> namedHitters = rdd
.leftOuterJoin(userNameLookup)
.map(joinedTuple -> {
String username = joinedTuple._2()._2().or("No name");
Long hitpoints = joinedTuple._2()._1();
return new Tuple2<>(username, hitpoints);
})
.sortBy(Tuple2::_2, false, PARTITIONS);
namedHitters.saveAsTextFile(outputPath + "/round3-" + time);
LOGGER.info("Five highest hitters (total: {}){}",
namedHitters.count(), namedHitters.take(5));
return null;
});
Output
15/11/25 20:34:23 Five highest hitters (total: 200)
[(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539),
[(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646),
[(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699),

[(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,

Performance
Yahoo Streaming Benchmark

5th Round Points
Spark makes some things easier
But Flink is real streaming
In Spark you often have to specify partitions

Development
Compared to Hadoop, both are awesome
Both provide uniﬁed programming model for diverse scenarios
Comfort level of abstraction varies with use-case
Spark's Java API is cumbersome compared to the Scala API
Working with both is fun
Docs are ok, but spotty

Testing
Testing distributed systems will always be hard
Functionally both can be tested nicely

The Judge's Call
It depends...

Use Spark, if
You have Cloudera, Hortonworks. etc support and depend on it
You want to heavily use Graph and ML libraries
You want to use the more mature project

Use Flink, if
Real-Time processing is important for your use case
You want more complex window operations
You develop in Java only
If you want to support a German project

Benchmark References
[1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-
flink-batch-processing/
[2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html
[3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html
[4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
[5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Thank You!
Questions?
 michael.pisula@tng.tech konstantin.knauf@tng.tech

21.04.2016 Meetup: Spark vs. Flink

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to 21.04.2016 Meetup: Spark vs. Flink (20)

More from Comsysto Reply GmbH (20)

Recently uploaded (20)

21.04.2016 Meetup: Spark vs. Flink