Scylla Summit 2018: Best Practices for Running Spark with Scylla

Best Practices for Running
Spark with Scylla
Eyal Gutkind - Head of Solutions Architects

Eyal Gutkind is head of solution architects at Scylla. Prior to
Scylla Eyal held product management roles at Mirantis and
DataStax. Prior to DataStax Eyal spent 12 years with Mellanox
Technologies in various engineering management and product
marketing roles. Eyal holds a BSc. degree in Electrical and
Computer Engineering from Ben Gurion University, Israel and
MBA from Fuqua School of Business at Duke University, North
Carolina.
Speaker

Scylla token architecture
source: http://docs.scylladb.com/architecture/ringarchitecture/

Spark and Spark partitions
source: https://spark.apache.org/docs/latest/cluster-overview.html

Spark and Spark partitions
Node 1
RDD1
Partition
1
RDD2
Partition
4
Node 2
RDD1
Partition
4
RDD2
Partition
2
Node 3
RDD1
Partition
2
RDD2
Partition
3
Node 4
RDD1
Partition
3
RDD2
Partition
1

8
Scylla to Spark, partition considerations
RDD 1 Partition 3
Pkey1 Col1 Col2 Col3
Col1 Col2 Col3Pkey2
Col1 Col2 Col3Pkey7342

The Cassandra-Spark connector
https://github.com/datastax/spark-cassandra-connector
▪ Provides Spark Context to data stored in Scylla/Cassandra
▪ Batch writes
▪ Read Scylla/Cassandra partitions to Spark Partitions
▪ Connection management between Scylla and Spark driver and
executors
▪ Utilizes the Cassandra Java driver

When Spark writes to Scylla
10
output.batch.grouping.buffer.size
output.batch.size.bytes
output.concurrent.writes
output.batch.grouping.key

When Spark reads from Scylla
11
input.split.size_in_mb
Don’t forget data is compressed on Disk!
Scylla paging capabilities will have an impact!
input.fetch.size_in_rows

To collocate or not to collocate?

▪ Increase default Spark parallelism (number
of cores in the Spark local machine deployment)
▪ Reduced Spark split size (64 -> 1)
▪ Connection.connections_per_executor_max
(# of core or more)
▪ Output.concurrent.writes default 5
▪ Concurrent.reads default is 512
Fine tuning Spark performance with Scylla

▪ Scylla enables analytics on top transactional data
▪ Performance tuning is required for certain workloads
▪ Resource management is key to stability of your deployment
Conclusion

Q&A
Stay in touch
Learn more
eyal@scylladb.com
@gutkinde
scylladb.com/blog
scylladb-users.slack.com

Scylla Summit 2018: Best Practices for Running Spark with Scylla

More Related Content

Similar to Scylla Summit 2018: Best Practices for Running Spark with Scylla (20)

More from ScyllaDB (20)

Recently uploaded (20)

Scylla Summit 2018: Best Practices for Running Spark with Scylla