SlideShare a Scribd company logo
Best Practices for Running
Spark with Scylla
Eyal Gutkind - Head of Solutions Architects
Eyal Gutkind is head of solution architects at Scylla. Prior to
Scylla Eyal held product management roles at Mirantis and
DataStax. Prior to DataStax Eyal spent 12 years with Mellanox
Technologies in various engineering management and product
marketing roles. Eyal holds a BSc. degree in Electrical and
Computer Engineering from Ben Gurion University, Israel and
MBA from Fuqua School of Business at Duke University, North
Carolina.
Speaker
Analytics
Scylla token architecture
source: http://docs.scylladb.com/architecture/ringarchitecture/
Scylla token architecture
source: http://docs.scylladb.com/architecture/ringarchitecture/
Spark and Spark partitions
source: https://spark.apache.org/docs/latest/cluster-overview.html
Spark and Spark partitions
Node 1
RDD1
Partition
1
RDD2
Partition
4
Node 2
RDD1
Partition
4
RDD2
Partition
2
Node 3
RDD1
Partition
2
RDD2
Partition
3
Node 4
RDD1
Partition
3
RDD2
Partition
1
8
Scylla to Spark, partition considerations
RDD 1 Partition 3
Pkey1 Col1 Col2 Col3
Col1 Col2 Col3Pkey2
Col1 Col2 Col3Pkey7342
The Cassandra-Spark connector
https://github.com/datastax/spark-cassandra-connector
â–Ș Provides Spark Context to data stored in Scylla/Cassandra
â–Ș Batch writes
â–Ș Read Scylla/Cassandra partitions to Spark Partitions
â–Ș Connection management between Scylla and Spark driver and
executors
â–Ș Utilizes the Cassandra Java driver
When Spark writes to Scylla
10
output.batch.grouping.buffer.size
output.batch.size.bytes
output.concurrent.writes
output.batch.grouping.key
When Spark reads from Scylla
11
input.split.size_in_mb
Don’t forget data is compressed on Disk!
Scylla paging capabilities will have an impact!
input.fetch.size_in_rows
To collocate or not to collocate?
â–Ș Increase default Spark parallelism (number
of cores in the Spark local machine deployment)
â–Ș Reduced Spark split size (64 -> 1)
â–Ș Connection.connections_per_executor_max
(# of core or more)
â–Ș Output.concurrent.writes default 5
â–Ș Concurrent.reads default is 512
Fine tuning Spark performance with Scylla
â–Ș Scylla enables analytics on top transactional data
â–Ș Performance tuning is required for certain workloads
â–Ș Resource management is key to stability of your deployment
Conclusion
Q&A
Stay in touch
Learn more
eyal@scylladb.com
@gutkinde
scylladb.com/blog
scylladb-users.slack.com

More Related Content

PDF
Slidedeck Datenanalysen auf Speed - Oracle R Enterprise (ORE) Demo - DOAG Big...
PPTX
Software Realibility on the Big Data Era
PDF
Spark Powered by Scylla
PPTX
Running Apache Spark on Kubernetes
 
PPT
State Of FPGA: Current & Future - A Panel discussion @ 4th FPGA Camp
PDF
Looking Inside the MySQL 8.0 Document Store
PDF
Aleksejs Nemirovskis - Manage your data using oracle BDA
DOC
SM-re-ex1
Slidedeck Datenanalysen auf Speed - Oracle R Enterprise (ORE) Demo - DOAG Big...
Software Realibility on the Big Data Era
Spark Powered by Scylla
Running Apache Spark on Kubernetes
 
State Of FPGA: Current & Future - A Panel discussion @ 4th FPGA Camp
Looking Inside the MySQL 8.0 Document Store
Aleksejs Nemirovskis - Manage your data using oracle BDA
SM-re-ex1

Similar to Scylla Summit 2018: Best Practices for Running Spark with Scylla (20)

PDF
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
PDF
Introduction to Spark: Or how I learned to love 'big data' after all.
PDF
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
DOCX
hjsklar CV
PDF
Distributed Deep Learning At Scale On Apache Spark With BigDL
PDF
FPGAs – CHRONOLOGICAL DEVELOPMENTS AND CHALLENGES
PDF
Addressing the High Cost of Apache Cassandra
PDF
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
PDF
11회 Oracle Developer Meetup 발표 ìžëŁŒ: Oracle NoSQL (2019.05.18) oracle-nosql pu...
PDF
Splunk PNW User Group - Seattle - 2023-06-28.pdf
PDF
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
PDF
Enterprise data science - What it takes to build?
DOCX
GuidoBonelli
PDF
Getting Started with Spark Scala
PDF
Spark + i python
PDF
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
PPTX
DGX Sessions You Won't Want to Miss at GTC 2019
 
PPTX
OOW19 - HOL5221
PDF
Workshop - How to benchmark your database
PPTX
FPGA based mini Project.pptx
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Introduction to Spark: Or how I learned to love 'big data' after all.
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
hjsklar CV
Distributed Deep Learning At Scale On Apache Spark With BigDL
FPGAs – CHRONOLOGICAL DEVELOPMENTS AND CHALLENGES
Addressing the High Cost of Apache Cassandra
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
11회 Oracle Developer Meetup 발표 ìžëŁŒ: Oracle NoSQL (2019.05.18) oracle-nosql pu...
Splunk PNW User Group - Seattle - 2023-06-28.pdf
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Enterprise data science - What it takes to build?
GuidoBonelli
Getting Started with Spark Scala
Spark + i python
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
DGX Sessions You Won't Want to Miss at GTC 2019
 
OOW19 - HOL5221
Workshop - How to benchmark your database
FPGA based mini Project.pptx
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Ad

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
medical staffing services at VALiNTRY
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms I-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Understanding Forklifts - TECH EHS Solution
Odoo Companies in India – Driving Business Transformation.pdf
ai tools demonstartion for schools and inter college
CHAPTER 2 - PM Management and IT Context
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
history of c programming in notes for students .pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Softaken Excel to vCard Converter Software.pdf
medical staffing services at VALiNTRY
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How to Migrate SBCGlobal Email to Yahoo Easily

Scylla Summit 2018: Best Practices for Running Spark with Scylla

  • 1. Best Practices for Running Spark with Scylla Eyal Gutkind - Head of Solutions Architects
  • 2. Eyal Gutkind is head of solution architects at Scylla. Prior to Scylla Eyal held product management roles at Mirantis and DataStax. Prior to DataStax Eyal spent 12 years with Mellanox Technologies in various engineering management and product marketing roles. Eyal holds a BSc. degree in Electrical and Computer Engineering from Ben Gurion University, Israel and MBA from Fuqua School of Business at Duke University, North Carolina. Speaker
  • 4. Scylla token architecture source: http://docs.scylladb.com/architecture/ringarchitecture/
  • 5. Scylla token architecture source: http://docs.scylladb.com/architecture/ringarchitecture/
  • 6. Spark and Spark partitions source: https://spark.apache.org/docs/latest/cluster-overview.html
  • 7. Spark and Spark partitions Node 1 RDD1 Partition 1 RDD2 Partition 4 Node 2 RDD1 Partition 4 RDD2 Partition 2 Node 3 RDD1 Partition 2 RDD2 Partition 3 Node 4 RDD1 Partition 3 RDD2 Partition 1
  • 8. 8 Scylla to Spark, partition considerations RDD 1 Partition 3 Pkey1 Col1 Col2 Col3 Col1 Col2 Col3Pkey2 Col1 Col2 Col3Pkey7342
  • 9. The Cassandra-Spark connector https://github.com/datastax/spark-cassandra-connector â–Ș Provides Spark Context to data stored in Scylla/Cassandra â–Ș Batch writes â–Ș Read Scylla/Cassandra partitions to Spark Partitions â–Ș Connection management between Scylla and Spark driver and executors â–Ș Utilizes the Cassandra Java driver
  • 10. When Spark writes to Scylla 10 output.batch.grouping.buffer.size output.batch.size.bytes output.concurrent.writes output.batch.grouping.key
  • 11. When Spark reads from Scylla 11 input.split.size_in_mb Don’t forget data is compressed on Disk! Scylla paging capabilities will have an impact! input.fetch.size_in_rows
  • 12. To collocate or not to collocate?
  • 13. â–Ș Increase default Spark parallelism (number of cores in the Spark local machine deployment) â–Ș Reduced Spark split size (64 -> 1) â–Ș Connection.connections_per_executor_max (# of core or more) â–Ș Output.concurrent.writes default 5 â–Ș Concurrent.reads default is 512 Fine tuning Spark performance with Scylla
  • 14. â–Ș Scylla enables analytics on top transactional data â–Ș Performance tuning is required for certain workloads â–Ș Resource management is key to stability of your deployment Conclusion
  • 15. Q&A Stay in touch Learn more eyal@scylladb.com @gutkinde scylladb.com/blog scylladb-users.slack.com