Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla

Joining Billions of Rows
in Seconds: Replacing
MongoDB and Hive with Scylla
Alexys Jacob - CTO, Numberly

Moderator - Peter Corless, ScyllaDB
Peter has a 29-year career in Silicon Valley that
threads through stints at e2f, Aerospike, Cisco and
Apple. He is passionate about technology, customer
success, engendering community, and social media. In
his off hours he enjoys playing 4X strategy games.
Twitter: @petercorless
2

3
+ The Real-Time Big Data Database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
+ Learn more at scylladb.com
About ScyllaDB

Presenter - Alexys Jacob, Numberly
4

1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
5

Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
6

Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all
data sources and destinations.
➔ For this we use ID matching tables.
7

ID matching tables
JOIN
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable
by partner
Queried AND updated all the time!
➔ High read AND write workload
8

Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH ACTIVATE
9

Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
10

Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
11

Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
13

Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
14

Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID
over the last N months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
15

Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to
DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
16

scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
17

Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
18

Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+
19

Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
21

Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+
22

Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7
23

Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
24

Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
25

Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes

OK for Scala, what about Python?
No joinWithCassandraTable when
using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
27

Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count
28

Dask + Hive + Scylla time break down
50 seconds
10 seconds
60 seconds
Hive
Scylla
(ID matching)
Partitions
count
29

Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!
30

Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes

Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:

Production environment
+ 6x DELL R640
+ dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager
34

Q&A
Stay in touch
alexys@numberly.com
@ultrabug
ultrabug.fr

United States
1900 Embarcadero Road
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank You!

Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla

More Related Content

What's hot (20)

Similar to Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla (20)

More from ScyllaDB (20)

Recently uploaded (20)

Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla