SlideShare a Scribd company logo
Kafka, Spark and Cassandra
instaclustr.com
Who am I and what do I do?
• Ben Bromhead
• Co-founder and CTO of Instaclustr -> www.instaclustr.com
• Instaclustr provides Cassandra-as-a-Service in the cloud.
• Currently run in IBM Softlayer, AWS and Azure
• 500+ nodes under management
What will this talk cover?
• An introduction to Cassandra
• An introduction to Spark
• An introduction to Kafka
• Building a data pipeline with Cassandra, Spark & Kafka
What happens when you have more data than could fit on a
single server?
Throw money away at the problem
Introducing Cassandra
• BigTable (2006) - 1 Key: Lots of values, Fast sequential access
• Dynamo (2007) - Reliable, Performant, Always On,
• Cassandra (2008) - Dynamo Architecture, BigTable data model and
storage
One database, many servers
• All servers (nodes) participate in
the cluster
• Shared nothing
• Need more capacity add more
servers
• Multiple servers == built in
redundancy
1
3
24
How does it work ?
0
4
28
Partitioning
Name Age Postcode Gender
Alice 34 2000 F
Bob 26 2000 M
Eve 25 2004 F
Frank 41 2902 M
How does it work ?
0
4
28
client
consistentHash(“Alice”)
A brief intro to tuneable consistency
• Cassandra is considered to be a database that favours Availability and
Partition Tolerance.
• Let’s you change those characteristics per query to suit your
application requirement.
• Define your replication factor on the schema level
• Define your consistency level at query time
How does it work ?
client
consistentHash(“Alice”)
0
4
28
Replication Factor = 3
What are the benefits to this approach
• Linear scalability
• High Availability
• Use commodity hardware
Linear scalability
48 Nodes 96 Nodes 144 Nodes 288 Nodes
Writes per
second per node
10,900 11,460 11,900 11,456
Mean Latency 0.0117ms 0.0134ms 0.0148ms 0.0139ms
Cluster Writes
per second
174,373 366,828 537,172 1,099,837
Linear scalability
High Availability
“During Hurricane Sandy, we lost an entire
data center. Completely. Lost. It. Our
application fail-over resulted in us losing just a
few moments of serving requests for a particular
region of the country, but our data in
Cassandra never went offline.”
Nathan Milford, Outbrain’s head of U.S. IT operations management
Commodity Hardware
How do we keep data consistent ?
client
consistentHash(“Alice”)
0
4
28
CL.QUORUM (50% + 1)
Write
Ack
Ack
X
Add capacity
1
5
37
client
consistentHash(“Alice”)
Analytics & Cassandra
• What about ad-hoc queries?
• What was the minimum, maximum and average latency for a given
client
• Give me all devices that had a temperature > 40 for longer than 20
minutes
• Top 10 locations where vehicles recorded speeds > 60
Introducing Spark
• A distributed computing engine.
• For very large datasets.
• Essentially a way to run any querie or algorithms over a very large set of
data.
• Works with the existing hadoop ecosystem.
Spark
• Faster and Better than hadoop
• In-memory (100x faster)
• Intelligent caching on disk (10x faster)
• Fault tolerant, immutable datasets and intermediate steps (DRY)
• Sane API and integrations (never write map/reduce jobs again)
Spark vs Hadoop
Initial input Intermediate step Final output
Initial input Intermediate step Final output
How fast is spark?
Hadoop MR Record Spark Record Spark 1PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s 618 GB/s 570 GB/s
Daytona Rules Yes Yes No
Network
dedicated data
center, 10Gbps
virtualized (EC2)
10Gbps network
virtualized (EC2)
10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
A quick use case
Square Kilometre array (SKA)
700TB/Second raw data
Spark part of the data processing pipeline
See http://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann
Spark is also easy
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark word count
spark.cassandraTable(“Keyspace","Table").
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Spark
Hadoop word count
public class WordCount {
public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
Context context)
throws IOException,
InterruptedException {
StringTokenizer itr = new
StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends
Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new
IntWritable();
public void reduce(Text key,
Iterable<IntWritable> values,
Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args)
throws Exception {
Configuration conf = new
Configuration();
String[] otherArgs = new
GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new
Path(otherArgs[1]));
System.exit(job.waitForCompletion(true)
? 0 : 1);
}
}
Spark
public class WordCount {
public static void main(String[] args) {
JavaRDD<String> textFile = spark.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile(“hdfs://...");
}
}
Spark word count
Spark
What happens under the hood?
val textFile = spark.textFile("s3a://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
textFile flatmap map reduceByKey collect
Spark
Spark + Cassandra
• Analytics with a super fast operational db?
• Use https://github.com/datastax/spark-cassandra-connector
Spark + Cassandra
Executor
Executor
Executor
Worker
Server
Spark + Cassandra
0
4
28
Worker
Worker
Worker
Worker
Master
Spark + Cassandra
• We now have a great platform to work with data already in Cassandra
• But what if we have a stream of data (e.g. a lot of devices sending stuff
to us all the time)
• We want answers now and as events happen
Spark + Cassandra
vs
Streaming
Spark + Cassandra
Spark + Cassandra
• After you have performed calculations on the DStream object provided
by Spark, persist to Cassandra
• Simply call .saveToCassandra(Table,Keyspace)
Spark + Cassandra
• What about ingest?
• Enter Kafka
What is Kafka
• Unified platform for handling message feeds (aka message bus)
• High Volume
• Derived Feeds
• Support large feeds from offline ingest
• Low latency messaging
• Fault tolerance during machine failure
What is Kafka
• Publish / Subscribe message architecture
• Consumers “receive” messages
• Publishers “send” messages
• Message routing determined based on “Topic”
• Topics are split into partitions which are replicated
What is Kafka
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
What is Kafka
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
A quick case study
• Kafka at LinkedIn
• 15 broken
• 15,500 partitions (replication factor 2)
• 400,000 msg/s
• event processing
Spark + Cassandra + Kafka
• Why Kafka?
• Pluggable receivers for MQTT and HTTP ingest
• Spark Streaming can consume directly from a Kafka Queue and write
to Cassandra
Spark + Cassandra + Kafka
Putting it all together:
Kafka cluster
Master
0
4
28
Wor
ker
Wor
ker
Wor
ker
Wor
ker
MQTT bridge
Spark + Cassandra + Kafka
Putting it all together:
Kafka cluster
Master
0
4
28
Wor
ker
Wor
ker
Wor
ker
Wor
ker
MQTT bridge
Lambda Architecture!
• A highly distributed, resilient and highly available ingest, compute and
storage engine.
• Leverage additional spark libraries to add capabilities to your project.
• Bonus: SparkML, bringing machine learning and artificial intelligence to
your data pipeline.
Spark + Cassandra + Kafka
Questions

More Related Content

PDF
Cassandra CLuster Management by Japan Cassandra Community
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
PDF
Apache cassandra & apache spark for time series data
PDF
Real-time Cassandra
PPTX
Kafka Lambda architecture with mirroring
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
PDF
Feeding Cassandra with Spark-Streaming and Kafka
PDF
Lambda architecture
Cassandra CLuster Management by Japan Cassandra Community
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
Apache cassandra & apache spark for time series data
Real-time Cassandra
Kafka Lambda architecture with mirroring
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Feeding Cassandra with Spark-Streaming and Kafka
Lambda architecture

What's hot (19)

PDF
Real-time personal trainer on the SMACK stack
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PPTX
Building a unified data pipeline in Apache Spark
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
PDF
Webinar: How to Shrink Your Datacenter Footprint by 50%
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Using the SDACK Architecture to Build a Big Data Product
PDF
Cassandra & Spark for IoT
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Spark with Cassandra by Christopher Batey
PDF
Micro-batching: High-performance writes
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
PDF
Deep dive into event store using Apache Cassandra
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
PDF
Pythian: My First 100 days with a Cassandra Cluster
PDF
Getting Started Running Apache Spark on Apache Mesos
Real-time personal trainer on the SMACK stack
Spark + Cassandra = Real Time Analytics on Operational Data
Building a unified data pipeline in Apache Spark
Real time data pipeline with spark streaming and cassandra with mesos
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Real-Time Analytics with Apache Cassandra and Apache Spark
Webinar: How to Shrink Your Datacenter Footprint by 50%
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Using the SDACK Architecture to Build a Big Data Product
Cassandra & Spark for IoT
DataStax and Esri: Geotemporal IoT Search and Analytics
Spark with Cassandra by Christopher Batey
Micro-batching: High-performance writes
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Deep dive into event store using Apache Cassandra
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Pythian: My First 100 days with a Cassandra Cluster
Getting Started Running Apache Spark on Apache Mesos
Ad

Similar to Kafka spark cassandra webinar feb 16 2016 (20)

PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark,
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PPTX
Apache Cassandra introduction
PDF
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
PDF
Scylla db deck, july 2017
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Hadoop world overview trends and topics
PDF
Introduction to apache kafka, confluent and why they matter
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PPTX
Cassandra implementation for collecting data and presenting data
PDF
SMACK Stack 1.1
PDF
MySQL Cluster Scaling to a Billion Queries
PDF
Big Telco Real-Time Network Analytics
PDF
Big Telco - Yousun Jeong
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Jump Start with Apache Spark 2.0 on Databricks
Real-Time Analytics with Apache Cassandra and Apache Spark,
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Apache Cassandra introduction
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Scylla db deck, july 2017
Intro to Apache Spark by CTO of Twingo
Large Scale Lakehouse Implementation Using Structured Streaming
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Hadoop world overview trends and topics
Introduction to apache kafka, confluent and why they matter
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Cassandra implementation for collecting data and presenting data
SMACK Stack 1.1
MySQL Cluster Scaling to a Billion Queries
Big Telco Real-Time Network Analytics
Big Telco - Yousun Jeong
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Kafka spark cassandra webinar feb 16 2016

  • 1. Kafka, Spark and Cassandra instaclustr.com
  • 2. Who am I and what do I do? • Ben Bromhead • Co-founder and CTO of Instaclustr -> www.instaclustr.com • Instaclustr provides Cassandra-as-a-Service in the cloud. • Currently run in IBM Softlayer, AWS and Azure • 500+ nodes under management
  • 3. What will this talk cover? • An introduction to Cassandra • An introduction to Spark • An introduction to Kafka • Building a data pipeline with Cassandra, Spark & Kafka
  • 4. What happens when you have more data than could fit on a single server?
  • 5. Throw money away at the problem
  • 6. Introducing Cassandra • BigTable (2006) - 1 Key: Lots of values, Fast sequential access • Dynamo (2007) - Reliable, Performant, Always On, • Cassandra (2008) - Dynamo Architecture, BigTable data model and storage
  • 7. One database, many servers • All servers (nodes) participate in the cluster • Shared nothing • Need more capacity add more servers • Multiple servers == built in redundancy 1 3 24
  • 8. How does it work ? 0 4 28
  • 9. Partitioning Name Age Postcode Gender Alice 34 2000 F Bob 26 2000 M Eve 25 2004 F Frank 41 2902 M
  • 10. How does it work ? 0 4 28 client consistentHash(“Alice”)
  • 11. A brief intro to tuneable consistency • Cassandra is considered to be a database that favours Availability and Partition Tolerance. • Let’s you change those characteristics per query to suit your application requirement. • Define your replication factor on the schema level • Define your consistency level at query time
  • 12. How does it work ? client consistentHash(“Alice”) 0 4 28 Replication Factor = 3
  • 13. What are the benefits to this approach • Linear scalability • High Availability • Use commodity hardware
  • 14. Linear scalability 48 Nodes 96 Nodes 144 Nodes 288 Nodes Writes per second per node 10,900 11,460 11,900 11,456 Mean Latency 0.0117ms 0.0134ms 0.0148ms 0.0139ms Cluster Writes per second 174,373 366,828 537,172 1,099,837
  • 16. High Availability “During Hurricane Sandy, we lost an entire data center. Completely. Lost. It. Our application fail-over resulted in us losing just a few moments of serving requests for a particular region of the country, but our data in Cassandra never went offline.” Nathan Milford, Outbrain’s head of U.S. IT operations management
  • 18. How do we keep data consistent ? client consistentHash(“Alice”) 0 4 28 CL.QUORUM (50% + 1) Write Ack Ack X
  • 20. Analytics & Cassandra • What about ad-hoc queries? • What was the minimum, maximum and average latency for a given client • Give me all devices that had a temperature > 40 for longer than 20 minutes • Top 10 locations where vehicles recorded speeds > 60
  • 21. Introducing Spark • A distributed computing engine. • For very large datasets. • Essentially a way to run any querie or algorithms over a very large set of data. • Works with the existing hadoop ecosystem.
  • 22. Spark • Faster and Better than hadoop • In-memory (100x faster) • Intelligent caching on disk (10x faster) • Fault tolerant, immutable datasets and intermediate steps (DRY) • Sane API and integrations (never write map/reduce jobs again)
  • 23. Spark vs Hadoop Initial input Intermediate step Final output Initial input Intermediate step Final output
  • 24. How fast is spark? Hadoop MR Record Spark Record Spark 1PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 physical 6592 virtualized 6080 virtualized Cluster disk throughput 3150 GB/s 618 GB/s 570 GB/s Daytona Rules Yes Yes No Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
  • 25. A quick use case Square Kilometre array (SKA) 700TB/Second raw data Spark part of the data processing pipeline See http://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann
  • 26. Spark is also easy val textFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark word count spark.cassandraTable(“Keyspace","Table"). .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .saveAsTextFile("hdfs://...")
  • 27. Spark Hadoop word count public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 28. Spark public class WordCount { public static void main(String[] args) { JavaRDD<String> textFile = spark.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile(“hdfs://..."); } } Spark word count
  • 29. Spark What happens under the hood? val textFile = spark.textFile("s3a://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .collect() textFile flatmap map reduceByKey collect
  • 30. Spark
  • 31. Spark + Cassandra • Analytics with a super fast operational db? • Use https://github.com/datastax/spark-cassandra-connector
  • 34. Spark + Cassandra • We now have a great platform to work with data already in Cassandra • But what if we have a stream of data (e.g. a lot of devices sending stuff to us all the time) • We want answers now and as events happen
  • 37. Spark + Cassandra • After you have performed calculations on the DStream object provided by Spark, persist to Cassandra • Simply call .saveToCassandra(Table,Keyspace)
  • 38. Spark + Cassandra • What about ingest? • Enter Kafka
  • 39. What is Kafka • Unified platform for handling message feeds (aka message bus) • High Volume • Derived Feeds • Support large feeds from offline ingest • Low latency messaging • Fault tolerance during machine failure
  • 40. What is Kafka • Publish / Subscribe message architecture • Consumers “receive” messages • Publishers “send” messages • Message routing determined based on “Topic” • Topics are split into partitions which are replicated
  • 43. A quick case study • Kafka at LinkedIn • 15 broken • 15,500 partitions (replication factor 2) • 400,000 msg/s • event processing
  • 44. Spark + Cassandra + Kafka • Why Kafka? • Pluggable receivers for MQTT and HTTP ingest • Spark Streaming can consume directly from a Kafka Queue and write to Cassandra
  • 45. Spark + Cassandra + Kafka Putting it all together: Kafka cluster Master 0 4 28 Wor ker Wor ker Wor ker Wor ker MQTT bridge
  • 46. Spark + Cassandra + Kafka Putting it all together: Kafka cluster Master 0 4 28 Wor ker Wor ker Wor ker Wor ker MQTT bridge Lambda Architecture!
  • 47. • A highly distributed, resilient and highly available ingest, compute and storage engine. • Leverage additional spark libraries to add capabilities to your project. • Bonus: SparkML, bringing machine learning and artificial intelligence to your data pipeline. Spark + Cassandra + Kafka