SlideShare a Scribd company logo
Joining Billions of Rows
in Seconds: Replacing
MongoDB and Hive with Scylla
Alexys Jacob - CTO, Numberly
Moderator - Peter Corless, ScyllaDB
Peter has a 29-year career in Silicon Valley that
threads through stints at e2f, Aerospike, Cisco and
Apple. He is passionate about technology, customer
success, engendering community, and social media. In
his off hours he enjoys playing 4X strategy games.
Twitter: @petercorless
2
3
+ The Real-Time Big Data Database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
+ Learn more at scylladb.com
About ScyllaDB
Presenter - Alexys Jacob, Numberly
4
1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
5
Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
6
Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all
data sources and destinations.
➔ For this we use ID matching tables.
7
ID matching tables
JOIN
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable
by partner
Queried AND updated all the time!
➔ High read AND write workload
8
Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH ACTIVATE
9
Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
10
Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
11
Scylla?
Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
13
Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
14
Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID
over the last N months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
15
Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to
DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
16
scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
17
Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
18
Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+
19
Let’s use Scylla!
Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
21
Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+
22
Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7
23
Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
24
Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
25
Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes
OK for Scala, what about Python?
No joinWithCassandraTable when
using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
27
Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count
28
Dask + Hive + Scylla time break down
50 seconds
10 seconds
60 seconds
Hive
Scylla
(ID matching)
Partitions
count
29
Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!
30
Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes
Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:
Scylla!
Production environment
+ 6x DELL R640
+ dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager
34
Q&A
Stay in touch
alexys@numberly.com
@ultrabug
ultrabug.fr
United States
1900 Embarcadero Road
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank You!

More Related Content

PPTX
Spark Sql and DataFrame
PDF
What's New in Apache Hive
PDF
Top 5 Mistakes When Writing Spark Applications
PPTX
Tanel Poder Oracle Scripts and Tools (2010)
PPTX
Apache Spark Components
PPTX
Apache cassandra v4.0
PDF
Kafka to the Maxka - (Kafka Performance Tuning)
PDF
M|18 How MariaDB Server Scales with Spider
Spark Sql and DataFrame
What's New in Apache Hive
Top 5 Mistakes When Writing Spark Applications
Tanel Poder Oracle Scripts and Tools (2010)
Apache Spark Components
Apache cassandra v4.0
Kafka to the Maxka - (Kafka Performance Tuning)
M|18 How MariaDB Server Scales with Spider

What's hot (20)

PDF
Spark performance tuning - Maksud Ibrahimov
PPTX
Why your Spark Job is Failing
PDF
Altinity Quickstart for ClickHouse
PDF
Deep Dive into Cassandra
PDF
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
Deep Dive into Apache Kafka
PDF
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
PDF
MySQL Performance for DevOps
PDF
Cassandra at Instagram (August 2013)
PDF
Native Support of Prometheus Monitoring in Apache Spark 3.0
PDF
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
PDF
Accelerate Ceph performance via SPDK related techniques
PPTX
Cassandra & puppet, scaling data at $15 per month
PDF
MySQL Performance - Best practices
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
KSQL: Streaming SQL for Kafka
PDF
MariaDB MaxScale
PDF
Deep Dive: Memory Management in Apache Spark
Spark performance tuning - Maksud Ibrahimov
Why your Spark Job is Failing
Altinity Quickstart for ClickHouse
Deep Dive into Cassandra
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Deep Dive into Apache Kafka
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
MySQL Performance for DevOps
Cassandra at Instagram (August 2013)
Native Support of Prometheus Monitoring in Apache Spark 3.0
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
Accelerate Ceph performance via SPDK related techniques
Cassandra & puppet, scaling data at $15 per month
MySQL Performance - Best practices
Designing Structured Streaming Pipelines—How to Architect Things Right
KSQL: Streaming SQL for Kafka
MariaDB MaxScale
Deep Dive: Memory Management in Apache Spark
Ad

Similar to Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla (20)

PPTX
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
PDF
Building Event Streaming Architectures on Scylla and Kafka
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PDF
ScyllaDB Virtual Workshop
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Developer Data Modeling Mistakes: From Postgres to NoSQL
PDF
How to achieve no compromise performance and availability
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
PPTX
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
PDF
What Developers Need to Unlearn for High Performance NoSQL
PDF
Scylla db deck, july 2017
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
PDF
Introducing Scylla Open Source 4.0
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PDF
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
PDF
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
PPTX
Meeting the challenges of OLTP Big Data with Scylla
PDF
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
PDF
The Path to ScyllaDB 5.2
PDF
Transforming the Database: Critical Innovations for Performance at Scale
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Building Event Streaming Architectures on Scylla and Kafka
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB Virtual Workshop
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Developer Data Modeling Mistakes: From Postgres to NoSQL
How to achieve no compromise performance and availability
Cassandra vs. ScyllaDB: Evolutionary Differences
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
What Developers Need to Unlearn for High Performance NoSQL
Scylla db deck, july 2017
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Introducing Scylla Open Source 4.0
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
Meeting the challenges of OLTP Big Data with Scylla
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
The Path to ScyllaDB 5.2
Transforming the Database: Critical Innovations for Performance at Scale
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
PDF
A Dist Sys Programmer's Journey into AI by Piotr Sarna
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...
A Dist Sys Programmer's Journey into AI by Piotr Sarna

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
sap open course for s4hana steps from ECC to s4
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
sap open course for s4hana steps from ECC to s4
MIND Revenue Release Quarter 2 2025 Press Release
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Programs and apps: productivity, graphics, security and other tools
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...

Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla

  • 1. Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla Alexys Jacob - CTO, Numberly
  • 2. Moderator - Peter Corless, ScyllaDB Peter has a 29-year career in Silicon Valley that threads through stints at e2f, Aerospike, Cisco and Apple. He is passionate about technology, customer success, engendering community, and social media. In his off hours he enjoys playing 4X strategy games. Twitter: @petercorless 2
  • 3. 3 + The Real-Time Big Data Database + Drop-in replacement for Cassandra + 10X the performance & low tail latency + Open source and enterprise editions + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA; Herzelia, Israel + Learn more at scylladb.com About ScyllaDB
  • 4. Presenter - Alexys Jacob, Numberly 4
  • 5. 1 Eiffel Tower 2 Soccer World Cups 15 Years in the Data industry Pythonista OSS enthusiast & contributor Gentoo Linux developer CTO at Numberly - living in Paris, France whoami @ultrabug 5
  • 6. Business context of Numberly Digital Marketing Technologist (MarTech) Handling the relationship between brands and people (People based) Dealing with multiple sources and a wide range of data types (Events) Mixing and correlating a massive amount of different types of events... ...which all have their own identifiers (think primary keys) 6
  • 7. Business context of Numberly Web navigation tracking (browser ID: cookie) CRM databases (email address, customer ID) Partners’ digital platforms (cookie ID, hash(email address)) Mobile phone apps (device ID: IDFA, GAID) Ability to synchronize and translate identifiers between all data sources and destinations. ➔ For this we use ID matching tables. 7
  • 8. ID matching tables JOIN 1. SELECT reference population 2. JOIN with the ID matching table 3. MATCHED population is usable by partner Queried AND updated all the time! ➔ High read AND write workload 8
  • 9. Real life example: retargeting From a database (email) to a web banner (cookie) Previous donors generous@coconut.fr isupportu@lab.com wiki4ever@wp.eu openinternet@free.fr https://kitty.eu AppNexus ... Google ID matching table Cookie id = 123 Cookie id = 297 ? Cookie id = 896 Ad Exchange User cookie id 123 SELECT MATCH ACTIVATE 9
  • 11. Drawbacks & pitfalls Events Message queues HDFS Real time Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline 11
  • 13. Future implementation using Scylla? Events Message queues Real time Programs Batch Calculation Scylla Batch pipeline Real time pipeline 13
  • 14. Proof Of Concept hardware Recycled hardware… ▪ 2x DELL R510 • 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC ▪ 1x DELL R710 • 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC ➔ Compete with our production? Scylla is in! 14
  • 15. Finding the right schema model Query based AND test-driven data modeling 1. What are all the cookie IDs associated to the given partner ID over the last N months? 2. What is the last cookie ID/date for the given partner ID? Gotcha: the reverse questions are also to be answered! ➔ Denormalization ➔ Prototype with your language of choice! 15
  • 16. Schema tip! > What is the last cookie ID for the given partner ID? TIP: CLUSTERING ORDER ▪ Defaults to ASC ➔ Latest value at the end of the sstable! ▪ Change “date” ordering to DESC ➔ Latest value at the top of the sstable ➔ Reduced read latency! 16
  • 17. scylla-grafana-monitoring Set it up and test it! ▪ Use cassandra-stress Key graphs: ▪ number of open connections ▪ cache hits / misses ▪ per shard/node distribution ▪ sstable reads TIP: reduce default scrape interval ▪ scrape_interval: 2s (4s default) ▪ scrape_timeout: 1s (5s default) 17
  • 18. Reference data and metrics Reference dataset ▪ 10M population ▪ 400M ID matching table ➔ Representative volumes Measured on our production stack, with real load NOT a benchmark! 18
  • 19. Results: ▪ idle cluster: 2 minutes, 15 seconds ▪ normal cluster: 4 minutes ▪ overloaded cluster: 15 minutes Spark 2 + Hive: reference metrics Hive (population) Hive (ID matching) Partitions count + 19
  • 21. Testing with Scylla Distinguish between hot and cold cache scenarios ▪ Cold cache: mostly disk I/O bound ▪ Hot cache: mostly memory bound Push your Scylla cluster to its limits! 21
  • 22. Spark 2 + Hive + Scylla Hive (population) Scylla (ID matching) Partitions count + 22
  • 23. Spark 2 / Scala test workload DataStax’s spark-cassandra-connector joinWithCassandraTable ▪ spark-cassandra-connector-2.0.1-s_2.11.jar ▪ Java 7 23
  • 24. Spark 2 tuning (1/2) Use a fixed number of executors ▪ spark.dynamicAllocation.enabled=false ▪ spark.executor.instances=30 Change Spark split size to match Scylla for read performance ▪ spark.cassandra.input.split.size_in_mb=1 Adjust reads per seconds ▪ spark.cassandra.input.reads_per_sec=6666 24
  • 25. Spark 2 tuning (2/2) Tune the number of connections opened by each executor ▪ spark.cassandra.connection.connections_per_executor_max=100 Align driver timeouts with server timeouts (check scylla.yaml) ▪ spark.cassandra.connection.timeout_ms=150000 ▪ spark.cassandra.read.timeout_ms=150000 ScyllaDB blog posts & webinar ▪ https://www.scylladb.com/2018/07/31/spark-scylla/ ▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/ ▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/ ▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/ 25
  • 26. Spark 2 + Scylla results Cold cache: 12 minutes Hot cache: 2 minutes Reference results: idle cluster: 2 minutes, 15 seconds normal cluster: 4 minutes overloaded cluster: 15 minutes
  • 27. OK for Scala, what about Python? No joinWithCassandraTable when using pyspark... Maybe we don’t need Spark 2 at all! 1. Load the 10M rows from Hive 2. For every row lookup the ID matching table from Scylla 3. Count the resulting number of matches 27
  • 28. Dask + Hive + Scylla Results: ▪ Cold cache: 6min ▪ Hot cache: 2min Hive (population) Scylla (ID matching) Partitions count 28
  • 29. Dask + Hive + Scylla time break down 50 seconds 10 seconds 60 seconds Hive Scylla (ID matching) Partitions count 29
  • 30. Dask + Parquet + Scylla Parquet files (HDFS) Scylla Partitions count 10 seconds! 30
  • 31. Dask + Scylla results Cold cache: 5 minutes Hot cache: 1 minute 5 seconds Spark 2 results: cold cache: 6 minutes hot cache: 2 minutes
  • 32. Python+Scylla with Parquet tips! ▪ Use execute_concurrent() ▪ Increase concurrency parameter (defaults to 100) ▪ Use libev as connection_class instead of asyncore ▪ Use hdfs3 + pyarrow to read and load Parquet files:
  • 34. Production environment + 6x DELL R640 + dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB Gentoo Linux Multi-DC setup Ansible based provisioning and backups Monitored by scylla-grafana-monitoring Housekeeping handled by scylla-manager 34
  • 36. United States 1900 Embarcadero Road Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank You!