SlideShare a Scribd company logo
Cassandra And Spark Dataframes
Russell Spitzer
Software Engineer @ Datastax
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Tungsten Gives Dataframes OffHeap Power!
Can compare memory off-heap and bitwise!
Code generation!
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
source
org.apache.spark.sql.cassandra
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
CassandraSourceRelation
CassandraTableScanRDDConfiguration
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "other" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Connector
Default
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame
DataFromC*
AND
add where clause to
CQL
"clusteringKey > 100"
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
What can be pushed down?
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate
2. Only push down primary key column predicates with = or IN predicate.
3. If there are regular columns in the pushdown predicates, they should have at least one EQ
expression on an indexed column and no IN predicates.
4. All partition column predicates must be included in the predicates to be pushed down, only
the last part of the partition key can be an IN predicate. For each partition column, only one
predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate including IN
predicate, and preceding column predicates must be EQ predicates.
6. If there is only one cluster column predicate, the predicates could be any non-IN predicate.
There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them is
equality or IN predicate.
What can be pushed down?
If you could write in CQL it will get pushed down.
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing

applies
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing

applies
https://academy.datastax.com/

Watch me talk about this in the privacy of your own home!
How the
Spark Cassandra Connector
Reads Data
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
Cassandra Data is Distributed By Token Range
Cassandra Data is Distributed By Token Range
0
500
Cassandra Data is Distributed By Token Range
0
500
999
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Node 1
120-220
300-500
780-830
0-50
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
The Connector Uses Information on the Node to Make 

Spark Partitions
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
1
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
1
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
2
1
Node 1
300-400
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
400-500
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
400-500
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make 

Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
421
Node 1
The Connector Uses Information on the Node to Make 

Spark Partitions
3
spark.cassandra.input.split_size_in_mb	
  1
Reported	
  density	
  is	
  100	
  tokens	
  per	
  mb
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
How The Spark
Cassandra Connector
Writes Data
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
The Spark Cassandra
Connector saveToCassandra
method can be called on
almost all RDDs
rdd.saveToCassandra("Keyspace","Table")
Node 11
A Java Driver connection is made to
the local node and a prepared statement
is built for the target table
Java
Driver
Node 11
Batches are built from data in
Spark partitions
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
3,9,1
Node 11
By default these batches only
contain CQL Rows which share the same
partition key
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
Node 11
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
By default these batches only
contain CQL Rows which share the same
partition key
PK=1
Node 11
When an element is not part of an existing batch,
a new batch is started
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
When an element is not part of an existing batch,
a new batch is started
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
When an element is not part of an existing batch,
a new batch is started
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
PK=2
PK=3
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
3,1,1
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
Node 11
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
If a batch size reaches
batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=1
PK=2
PK=3
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
PK=2
PK=3
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,1
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,13,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,9,13,9,1
Write Acknowledged
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
Block
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key	
  partition

spark.cassandra.output.batch.size.rows	
  	
  	
  	
  	
  	
  	
  	
  4	
  
spark.cassandra.output.batch.buffer.size	
  	
  	
  	
  	
  	
  3	
  
spark.cassandra.output.concurrent.writes	
  	
  	
  	
  	
  	
  2

spark.cassandra.output.throughput_mb_per_sec	
  	
  5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Thanks for Coming and I hope you Have a Great Time

At C* Summit
http://cassandrasummit-datastax.com/agenda/the-spark-
cassandra-connector-past-present-and-future/
Also ask these guys really hard questions
Jacek PiotrAlex

More Related Content

PDF
Apache Spark and DataStax Enablement
PDF
Lightning fast analytics with Spark and Cassandra
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Big data analytics with Spark & Cassandra
PDF
Zero to Streaming: Spark and Cassandra
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
Analytics with Cassandra & Spark
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Apache Spark and DataStax Enablement
Lightning fast analytics with Spark and Cassandra
Spark cassandra connector.API, Best Practices and Use-Cases
Big data analytics with Spark & Cassandra
Zero to Streaming: Spark and Cassandra
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Analytics with Cassandra & Spark
Spark + Cassandra = Real Time Analytics on Operational Data

What's hot (20)

PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
PDF
Cassandra spark connector
PDF
Cassandra and Spark: Optimizing for Data Locality
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
PDF
Spark and Cassandra 2 Fast 2 Furious
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Spark Streaming with Cassandra
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
OLAP with Cassandra and Spark
PDF
Time series with Apache Cassandra - Long version
PPTX
Lightning fast analytics with Cassandra and Spark
PDF
Spark Cassandra Connector: Past, Present, and Future
PDF
DataEngConf SF16 - Spark SQL Workshop
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
BI, Reporting and Analytics on Apache Cassandra
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
PPTX
Using Spark to Load Oracle Data into Cassandra
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Cassandra spark connector
Cassandra and Spark: Optimizing for Data Locality
Real time data pipeline with spark streaming and cassandra with mesos
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark and Cassandra 2 Fast 2 Furious
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark Streaming with Cassandra
Spark And Cassandra: 2 Fast, 2 Furious
OLAP with Cassandra and Spark
Time series with Apache Cassandra - Long version
Lightning fast analytics with Cassandra and Spark
Spark Cassandra Connector: Past, Present, and Future
DataEngConf SF16 - Spark SQL Workshop
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Lightning fast analytics with Spark and Cassandra
BI, Reporting and Analytics on Apache Cassandra
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Using Spark to Load Oracle Data into Cassandra
Ad

Viewers also liked (19)

PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PDF
Bulk Loading into Cassandra
PDF
Spark cassandra integration, theory and practice
PDF
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
PDF
Datastax enterprise presentation
PDF
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
PDF
Bulk Loading Data into Cassandra
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
PDF
Structured streaming in Spark
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PDF
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
PDF
Data Engineering with Solr and Spark
PDF
DataStax: A deep look at the CQL WHERE clause
PDF
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
PDF
Introduction to PySpark
PDF
Java Performance Analysis on Linux with Flame Graphs
PDF
Rethinking Streaming Analytics For Scale
PDF
DataStax: Spark Cassandra Connector - Past, Present and Future
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Bulk Loading into Cassandra
Spark cassandra integration, theory and practice
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
Datastax enterprise presentation
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Bulk Loading Data into Cassandra
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Structured streaming in Spark
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Data Engineering with Solr and Spark
DataStax: A deep look at the CQL WHERE clause
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Introduction to PySpark
Java Performance Analysis on Linux with Flame Graphs
Rethinking Streaming Analytics For Scale
DataStax: Spark Cassandra Connector - Past, Present and Future
Ad

Similar to Spark Cassandra Connector Dataframes (20)

PPTX
Maximum Overdrive: Tuning the Spark Cassandra Connector
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
PPTX
Montreal User Group - Cloning Cassandra
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Escape from Hadoop
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
PDF
Cassandra for Sysadmins
PDF
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
PDF
Breakthrough OLAP performance with Cassandra and Spark
ODP
Nyc summit intro_to_cassandra
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
Spark Summit EU talk by Ross Lawley
PDF
How To Connect Spark To Your Own Datasource
PDF
Sparkstreaming
PDF
Cassandra and Spark
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
PDF
A Tale of Two APIs: Using Spark Streaming In Production
PDF
The Apache Cassandra ecosystem
Maximum Overdrive: Tuning the Spark Cassandra Connector
Apache cassandra and spark. you got the the lighter, let's start the fire
Montreal User Group - Cloning Cassandra
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Escape from Hadoop
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Cassandra for Sysadmins
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
Breakthrough OLAP performance with Cassandra and Spark
Nyc summit intro_to_cassandra
5 Ways to Use Spark to Enrich your Cassandra Environment
Spark Summit EU talk by Ross Lawley
How To Connect Spark To Your Own Datasource
Sparkstreaming
Cassandra and Spark
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
A Tale of Two APIs: Using Spark Streaming In Production
The Apache Cassandra ecosystem

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
top salesforce developer skills in 2025.pdf
PDF
AI in Product Development-omnex systems
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administration Chapter 2
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Transform Your Business with a Software ERP System
PPTX
history of c programming in notes for students .pptx
ai tools demonstartion for schools and inter college
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
ISO 45001 Occupational Health and Safety Management System
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Design an Analysis of Algorithms I-SECS-1021-03
top salesforce developer skills in 2025.pdf
AI in Product Development-omnex systems
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How Creative Agencies Leverage Project Management Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PTS Company Brochure 2025 (1).pdf.......
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administration Chapter 2
L1 - Introduction to python Backend.pptx
ManageIQ - Sprint 268 Review - Slide Deck
Transform Your Business with a Software ERP System
history of c programming in notes for students .pptx

Spark Cassandra Connector Dataframes

  • 1. Cassandra And Spark Dataframes Russell Spitzer Software Engineer @ Datastax
  • 2. Cassandra And Spark Dataframes
  • 3. Cassandra And Spark Dataframes
  • 4. Cassandra And Spark Dataframes
  • 5. Cassandra And Spark Dataframes
  • 6. Tungsten Gives Dataframes OffHeap Power! Can compare memory off-heap and bitwise! Code generation!
  • 7. The Core is the Cassandra Source https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra /** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */ DataFrame source org.apache.spark.sql.cassandra
  • 8. The Core is the Cassandra Source https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra /** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */ DataFrame CassandraSourceRelation CassandraTableScanRDDConfiguration
  • 9. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
  • 10. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 11. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 12. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 13. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
  • 14. Configuration Can Be Done on a Per Source Level clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128") val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load() Namespace: default Keyspace: test spark.cassandra.input.split.size_in_mb=128 Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32 Connector Default
  • 15. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100
  • 16. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show
  • 17. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show Catalyst
  • 18. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* Filter clusteringKey > 100 Show Catalyst https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
  • 19. Predicate Pushdown Is Automatic! Select * From cassandraTable where clusteringKey > 100 DataFrame DataFromC* AND add where clause to CQL "clusteringKey > 100" Show Catalyst https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra- connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
  • 20. What can be pushed down? 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. 6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.
  • 21. What can be pushed down? If you could write in CQL it will get pushed down.
  • 22. What are we Pushing Down To? CassandraTableScanRDD All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
 applies
  • 23. What are we Pushing Down To? CassandraTableScanRDD All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
 applies https://academy.datastax.com/
 Watch me talk about this in the privacy of your own home!
  • 24. How the Spark Cassandra Connector Reads Data
  • 25. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  • 26. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  • 27. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks
  • 28. Cassandra Data is Distributed By Token Range
  • 29. Cassandra Data is Distributed By Token Range 0 500
  • 30. Cassandra Data is Distributed By Token Range 0 500 999
  • 31. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4
  • 32. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 Without vnodes
  • 33. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 With vnodes
  • 34. Node 1 120-220 300-500 780-830 0-50 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb The Connector Uses Information on the Node to Make 
 Spark Partitions
  • 35. Node 1 120-220 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 1 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 36. 1 Node 1 120-220 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 37. 2 1 Node 1 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 38. 2 1 Node 1 300-500 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 39. 2 1 Node 1 300-400 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 40. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 41. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 42. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 400-500 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 43. 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 44. 4 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 45. 4 21 Node 1 0-50 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 46. 421 Node 1 The Connector Uses Information on the Node to Make 
 Spark Partitions 3 spark.cassandra.input.split_size_in_mb  1 Reported  density  is  100  tokens  per  mb
  • 47. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50780-830 Node 1
  • 48. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  • 49. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  • 50. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  • 51. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  • 52. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows 50 CQL Rows
  • 53. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows
  • 54. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  • 55. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  • 56. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 57. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 58. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 59. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 60. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 61. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 62. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 63. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 64. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  • 65. How The Spark Cassandra Connector Writes Data
  • 66. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  • 67. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  • 68. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 The Spark Cassandra Connector saveToCassandra method can be called on almost all RDDs rdd.saveToCassandra("Keyspace","Table")
  • 69. Node 11 A Java Driver connection is made to the local node and a prepared statement is built for the target table Java Driver
  • 70. Node 11 Batches are built from data in Spark partitions Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 3,9,1
  • 71. Node 11 By default these batches only contain CQL Rows which share the same partition key Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1
  • 72. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 By default these batches only contain CQL Rows which share the same partition key PK=1
  • 73. Node 11 When an element is not part of an existing batch, a new batch is started Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1
  • 74. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 When an element is not part of an existing batch, a new batch is started PK=1 PK=2
  • 75. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,1 3,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 When an element is not part of an existing batch, a new batch is started PK=1 PK=2
  • 76. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,13,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2 PK=3
  • 77. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,8,13,2,1 3,4,1 3,5,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1 PK=2 PK=3 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver
  • 78. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4,3,9,1 3,1,1 spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2
  • 79. Node 11 Java Driver 1,1,1 1,2,1 2,1,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4,3,9,1 spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 If a batch size reaches batch.size.rows or batch.size.bytes it is executed by the driver PK=1 PK=2 PK=3
  • 80. Node 11 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver Java Driver 1,1,1 1,2,1 2,1,1 3,1,1 1,4,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=1 PK=2 PK=3
  • 81. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 PK=2 PK=3 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver
  • 82. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver PK=2 PK=3 PK=5
  • 83. Node 11 Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 8,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,1 If more than batch.buffer.size batches are currently being made, the largest batch is executed by the Java Driver PK=2 PK=3 PK=5
  • 84. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,18,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,13,9,1 PK=2 PK=3 PK=5
  • 85. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,18,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,9,13,9,1 Write Acknowledged PK=2 PK=3 PK=5
  • 86. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 2,1,1 3,1,1 5,4,1 2,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=2 PK=3 PK=5
  • 87. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 3,1,1 5,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=3 PK=5
  • 88. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 3,1,1 5,4,1 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 89. Node 11 If more batches are currently being executed by the Java driver than concurrent.writes, we wait until one of the requests has been completed. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 90. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 91. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5 Write Acknowledged
  • 92. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 93. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5 Write Acknowledged
  • 94. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 Block 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 95. Node 11 The last parameter throughput_mb_per_sec blocks further batches if we have written more than that much in the past second. Java Driver 9,4,1 11,4, spark.cassandra.output.batch.grouping.key  partition
 spark.cassandra.output.batch.size.rows                4   spark.cassandra.output.batch.buffer.size            3   spark.cassandra.output.concurrent.writes            2
 spark.cassandra.output.throughput_mb_per_sec    5 3,1,1 5,4,1 8,4,1 3,9,1 PK=8 PK=3 PK=5
  • 96. Thanks for Coming and I hope you Have a Great Time
 At C* Summit http://cassandrasummit-datastax.com/agenda/the-spark- cassandra-connector-past-present-and-future/ Also ask these guys really hard questions Jacek PiotrAlex