Kafka spark cassandra webinar feb 16 2016

Kafka, Spark and Cassandra
instaclustr.com

Who am I and what do I do?
• Ben Bromhead
• Co-founder and CTO of Instaclustr -> www.instaclustr.com
• Instaclustr provides Cassandra-as-a-Service in the cloud.
• Currently run in IBM Softlayer, AWS and Azure
• 500+ nodes under management

What will this talk cover?
• An introduction to Cassandra
• An introduction to Spark
• An introduction to Kafka
• Building a data pipeline with Cassandra, Spark & Kafka

What happens when you have more data than could fit on a
single server?

Throw money away at the problem

Introducing Cassandra
• BigTable (2006) - 1 Key: Lots of values, Fast sequential access
• Dynamo (2007) - Reliable, Performant, Always On,
• Cassandra (2008) - Dynamo Architecture, BigTable data model and
storage

One database, many servers
• All servers (nodes) participate in
the cluster
• Shared nothing
• Need more capacity add more
servers
• Multiple servers == built in
redundancy
1
3
24

Partitioning
Name Age Postcode Gender
Alice 34 2000 F
Bob 26 2000 M
Eve 25 2004 F
Frank 41 2902 M

How does it work ?
0
4
28
client
consistentHash(“Alice”)

A brief intro to tuneable consistency
• Cassandra is considered to be a database that favours Availability and
Partition Tolerance.
• Let’s you change those characteristics per query to suit your
application requirement.
• Define your replication factor on the schema level
• Define your consistency level at query time

How does it work ?
client
0
4
28
Replication Factor = 3

What are the benefits to this approach
• Linear scalability
• High Availability
• Use commodity hardware

Linear scalability
48 Nodes 96 Nodes 144 Nodes 288 Nodes
Writes per
second per node
10,900 11,460 11,900 11,456
Mean Latency 0.0117ms 0.0134ms 0.0148ms 0.0139ms
Cluster Writes
per second
174,373 366,828 537,172 1,099,837

High Availability
“During Hurricane Sandy, we lost an entire
data center. Completely. Lost. It. Our
application fail-over resulted in us losing just a
few moments of serving requests for a particular
region of the country, but our data in
Cassandra never went offline.”
Nathan Milford, Outbrain’s head of U.S. IT operations management

How do we keep data consistent ?
client
0
4
28
CL.QUORUM (50% + 1)
Write
Ack
Ack
X

Add capacity
1
5
37
client

Analytics & Cassandra
• What about ad-hoc queries?
• What was the minimum, maximum and average latency for a given
client
• Give me all devices that had a temperature > 40 for longer than 20
minutes
• Top 10 locations where vehicles recorded speeds > 60

Introducing Spark
• A distributed computing engine.
• For very large datasets.
• Essentially a way to run any querie or algorithms over a very large set of
data.
• Works with the existing hadoop ecosystem.

Spark
• Faster and Better than hadoop
• In-memory (100x faster)
• Intelligent caching on disk (10x faster)
• Fault tolerant, immutable datasets and intermediate steps (DRY)
• Sane API and integrations (never write map/reduce jobs again)

Spark vs Hadoop
Initial input Intermediate step Final output
Initial input Intermediate step Final output

How fast is spark?
Hadoop MR Record Spark Record Spark 1PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s 618 GB/s 570 GB/s
Daytona Rules Yes Yes No
Network
dedicated data
center, 10Gbps
virtualized (EC2)
10Gbps network
virtualized (EC2)
10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

A quick use case
Square Kilometre array (SKA)
700TB/Second raw data
Spark part of the data processing pipeline
See http://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann

Spark is also easy
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark word count
spark.cassandraTable(“Keyspace","Table").
.flatMap(line => line.split(" "))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")

Spark
Hadoop word count
public class WordCount {
public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
Context context)
throws IOException,
InterruptedException {
StringTokenizer itr = new
StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends
Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new
IntWritable();
public void reduce(Text key,
Iterable<IntWritable> values,
Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args)
throws Exception {
Configuration conf = new
Configuration();
String[] otherArgs = new
GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new
Path(otherArgs[1]));
System.exit(job.waitForCompletion(true)
? 0 : 1);
}
}

Spark
public class WordCount {
public static void main(String[] args) {
JavaRDD<String> textFile = spark.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile(“hdfs://...");
}
}
Spark word count

Spark
What happens under the hood?
val textFile = spark.textFile("s3a://...")
val counts = textFile.flatMap(line => line.split(" "))
.reduceByKey(_ + _)
.collect()
textFile flatmap map reduceByKey collect

Spark + Cassandra
• Analytics with a super fast operational db?
• Use https://github.com/datastax/spark-cassandra-connector

Spark + Cassandra
Executor
Executor
Executor
Worker
Server

Spark + Cassandra
0
4
28
Worker
Worker
Worker
Worker
Master

Spark + Cassandra
• We now have a great platform to work with data already in Cassandra
• But what if we have a stream of data (e.g. a lot of devices sending stuff
to us all the time)
• We want answers now and as events happen

Spark + Cassandra
vs
Streaming

Spark + Cassandra
• After you have performed calculations on the DStream object provided
by Spark, persist to Cassandra
• Simply call .saveToCassandra(Table,Keyspace)

Spark + Cassandra
• What about ingest?
• Enter Kafka

What is Kafka
• Unified platform for handling message feeds (aka message bus)
• High Volume
• Derived Feeds
• Support large feeds from offline ingest
• Low latency messaging
• Fault tolerance during machine failure

What is Kafka
• Publish / Subscribe message architecture
• Consumers “receive” messages
• Publishers “send” messages
• Message routing determined based on “Topic”
• Topics are split into partitions which are replicated

What is Kafka
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/

A quick case study
• Kafka at LinkedIn
• 15 broken
• 15,500 partitions (replication factor 2)
• 400,000 msg/s
• event processing

Spark + Cassandra + Kafka
• Why Kafka?
• Pluggable receivers for MQTT and HTTP ingest
• Spark Streaming can consume directly from a Kafka Queue and write
to Cassandra

Putting it all together:
Kafka cluster
Master
0
4
28
Wor
ker
Wor
ker
Wor
ker
Wor
ker
MQTT bridge

Putting it all together:
Kafka cluster
Master
0
4
28
Wor
ker
Wor
ker
Wor
ker
Wor
ker
MQTT bridge
Lambda Architecture!

• A highly distributed, resilient and highly available ingest, compute and
storage engine.
• Leverage additional spark libraries to add capabilities to your project.
• Bonus: SparkML, bringing machine learning and artificial intelligence to
your data pipeline.

Kafka spark cassandra webinar feb 16 2016

More Related Content

What's hot (19)

Similar to Kafka spark cassandra webinar feb 16 2016 (20)

Recently uploaded (20)

Kafka spark cassandra webinar feb 16 2016