Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platform (CDP) with ScyllaDB

Building Zeotap's Privacy
Compliant Customer Data
Platform (CDP) with ScyllaDB
Shubham Patil, Lead Software Engineer
Safal Pandita, Senior Software Engineer

Presenters
Shubham Patil, Lead Software Engineer
■ Leads the platform engineering team at Zeotap for CDP product suite
■ Responsible for its architecture, design and engineering delivery
■ 6 years of experience building scalable distributed systems
Safal Pandita, Senior Software Engineer
■ Leads the Scylla integrations at Zeotap for CDP product suite
■ 4 years of experience in building scalable distributed systems

About Zeotap
Zeotap is a privacy-focused 360º Customer Data Platform
(CDP) made for privacy-sensitive marketers
■ Enables brands to better understand their customers
- 360º view
■ Built on GCP
■ Native 3P data enrichment from over 130 premium
sources
PRIVACY AND SECURITY IS IN OUR DNA
2018-2021:
Customer Data
Platform
2014-2021:
Stitching Data from 120 companies for
500m customers under Strict EU Privacy
Law For Better Targeting for Brands
https://www.youtube.com/watch?v=XS790sG1Y7I

What is a Customer Data Platform (CDP) ?
CONSENTED AND ACTIONABLE
TRUSTED GOLDEN RECORDS OF
1P CUSTOMER PROFILES TO
SUPPORT MARKETING GOALS
Data
Unification
Build your single
customer view
Consent
Unification
Unify consent across
user Ids and channels
Client ID
MAID
Email
Phone
Web
Cookies
Other IDs
Marketing Preferences
Consent Purposes
A GOLDEN RECORD
Your own private identity graph
Universal ID
Contract History
Demographics
Loyalty Status
CDP: Unification of all silos

Zeotap’s CDP Tech Requirements
Batch
(Data Onboarding)
Realtime
(Event Orchestration)
Privacy/Compliance
(Consent Mastering)
■ Ingestion of user data from
website interactions in real
time.
■ Real time activation of user
audience.
■ User opt-out, consent
management and mastering
etc.
■ Ingestion of e.g.
CRM/database dumps.
■ Batch activation of user
audience in DMPs
■ Bulk data exports to client
databases/sinks

CDP Tech Matrix
Requirements v1 v2 v3
Multi-regional, Multi-Tenant, Privacy and GDPR compliant deployment
Sub-second/Realtime writes (with BQ streaming inserts)
Sub-second/Realtime reads/deletes (for ‘On The Fly’ User Uniﬁcation)
Point Lookups
Works for data at every scale (few MegaBs to PetaBs)
Mature and transparent monitoring stack
Supports Spark integration to export data dumps to data lakes
Complete control on sizing of cluster/processing
Supports Encryption: At rest, value level, rotation (RawPII)
Complete control on underlying data model and scans
Simple SQL-like query capabilities
Enterprise Support

CDP Tech Matrix Review
Multi-regional, Multi-Tenant, Privacy and GDPR compliant deployment ✅
Sub-second/Realtime writes (with BQ streaming inserts) ✅
Sub-second/Realtime reads/deletes (for ‘On The Fly’ User Uniﬁcation) ❌
Point Lookups ❌
Works for data at every scale (few MegaBs to PetaBs) ✅
Mature and transparent monitoring stack ❌
Supports Spark integration to export data dumps to data lakes ✅
Complete control on sizing of cluster/processing ❌
Supports Encryption: At rest, value level, rotation (RawPII) ✅
Complete control on underlying data model and scans ✅
Simple SQL-like query capabilities ✅
Enterprise Support ✅

Multi-regional, Multi-Tenant, Privacy and GDPR compliant deployment ✅ ✅
Sub-second/Realtime writes (with BQ streaming inserts) ✅ ✅
Sub-second/Realtime reads/deletes (for ‘On The Fly’ User Uniﬁcation) ❌ ✅
Point Lookups ❌ ❌
Works for data at every scale (few MegaBs to PetaBs) ✅ ❌
Mature and transparent monitoring stack ❌ ❌
Supports Spark integration to export data dumps to data lakes ✅ ✅
Complete control on sizing of cluster/processing ❌ ✅
Supports Encryption: At rest, value level, rotation (RawPII) ✅ ✅
Complete control on underlying data model and scans ✅ ❌
Simple SQL-like query capabilities ✅ ❌
Enterprise Support ✅ ❌

Multi-regional, Multi-Tenant, Privacy and GDPR compliant deployment ✅ ✅ ✅
Sub-second/Realtime writes (with BQ streaming inserts) ✅ ✅ ✅
Sub-second/Realtime reads/deletes (for ‘On The Fly’ User Uniﬁcation) (600 ms in JG vs 30ms in Scylla) ❌ ✅ ✅
Point Lookups ❌ ❌ ✅
Works for data at every scale (few MegaBs to PetaBs) ✅ ❌ ✅
Mature and transparent monitoring stack ❌ ❌ ✅
Supports Spark integration to export data dumps to data lakes ✅ ✅ ✅
Complete control on sizing of cluster/processing ❌ ✅ ✅
Supports Encryption: At rest, value level, rotation (RawPII) ✅ ✅ ✅
Complete control on underlying data model and scans ✅ ❌ ✅
Simple SQL-like query capabilities ✅ ❌ ✅
Enterprise Support ✅ ❌ ✅

Requirements from a User Store
■ On-The-Fly User Unification (ID Resolution)
■ Fast lookup store with low latencies for both read and write
■ Flexible enough to be used as a Profile/Consent/ID store
■ Needed to be used as a linkage store
■ We needed TTL in a few different ways
• Profiles/Consents (Attributes in a Map)
• ID (Elements of a collection)
• ID Store (row level)

Pattern 2: Find User Proﬁles by UCID
Pattern 3: Stamp UCID in Id Store
Pattern 4: Insert proﬁles/consents/preferences in User Store
Read/Write Query Patterns
Pattern 1: Read IdStore by Id Type and Value

■ Isolation for each client achieved through keyspaces
■ Separate clusters for each region (EU, US, IN, UK)
■ Each keyspace could have a different schema
■ RF = 3, ICS Compaction, CL=QUORUM for Read/Write
■ Single table - acts as our proﬁle, consent and
linkage store
Data Model v1.0

Problems faced
■ Batch sizes became a bottleneck since our
transactions needed to be atomic across partitions.
We crossed the recommended limit of ~100K per
batch.
■ Collection sizes started increasing beyond the
recommended size of ~1MB
■ Latencies worsened due to large batches and
collection sizes. In some cases, queries started timing
out

Applied Solutions
■ Split queries into multiple batches with multiple
retries each batch.
■ Use Prepared statements to improve
performance
■ Use TTL to keep total volume under check

Bottlenecks - Hot Rows
■ Storing linkages in a collection became a
bottleneck due to our increasing scale
■ Going beyond recommended ~1MB per collection
reduces latency SLAs
■ Collections go through a
serialization/deserialization step in Scylla which
makes them slower compared to other data types

Updated Data Model
■ Queries that were timing out earlier(>10s) due to high linkages started succeeding
within our SLA’s (~30ms)
■ Separate linkages store - TTL’s easier to maintain on rows which was earlier
complicated on individual elements of a collection
■ No arbitrary limit on the number of linkages(~1MB) which allowed us to scale more
effectively

Production Gotchas - PK Migration
■ Problem : No easy way to migrate your primary key once the data is live in the tables.
■ Solution : Use Scylla Migrator to move the data to intermediate/temporary table with the
required schema.
• Since we wanted to reuse the names of our original tables (You can’t rename a
table), we had to copy the SSTables from our migrated schema.
• Lesson : Choose your PK wisely

Production Gotchas - Schema Corruption
■ Problem : Schemas can get corrupted while copying SSTables. Schema settlement under load can
sometimes take more than a minute and can cause cluster to crash.
■ Solutions
• Always check that your schema is correctly replicated on all nodes before attempting
SSTable copying.
• Ask the scylla team/manually SSH/write your own service around cqlsh
• Scylla team resolved our issue by restoring our snapshots and redoing the migration for the
affected schemas.
• Lesson : ALWAYS BACKUP YOUR DATA

Production Setup
4 Clusters (EU, UK,
IN, US) - 6
n2-higmem-64 nodes
- Scylla v2021.1.5
130+
client/keyspaces
being managed
Max 60K QPS
30 ms avg. read
10 ms avg. write
5.4 TBs - data
ingested
50 GBs - Max
keyspace

Future Plans
■ Microservices around handling Schema corruption and updates
■ Explore LightWeight Transactions (LWT’s) for consistency guarantees
■ Explore encrypted data rotation w/o blocking real time writes

Thank you!
Stay in touch
Shubham Patil & Safal Pandita
/itsshubhpatil, /safalpandita
patil.sm17@gmail.com
safalpandita@gmail.com

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platform (CDP) with ScyllaDB

More Related Content

What's hot (20)

Similar to Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platform (CDP) with ScyllaDB (20)

More from ScyllaDB (20)

Recently uploaded (20)

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platform (CDP) with ScyllaDB