Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Snapchat 2018
Analytics at
Snap
Big Data processing, slicing, and dicing
Charles Allen
charles.allen@snap.com
https://www.linkedin.com/in/charles-allen-255bab2a/

09.20.18
Who we are
Snap growth
Wrangling Data / Data tool chest
Druid’s powerhouse
Overview

Express yourself!
place creative here place creative here

Live in the moment
place creative here

Million DAU Q2
2014
Million DAU Q2
188
2018
Source: 10-K; 10-Q; earnings call transcripts
User base up
Advertiser value up
57

Trillions of interactions per
week.

Lack of data
causes pain
Natural pipeline development
Need
Find data signal,
and data
processing SME
Source
Work with
development
team for pipeline
Develop
To production!
Deploy
Fire and forget,
or keep it live?
Maintain
Getting insights into data

Common data consumption formats
Scripting
High level of expertise
Extremely dynamic
Usually either one-off for a specific
human. Or scripted for machine
consumption.
DashboardsReports
Small qty of KPIs
Big tables or worksheets
“Executive” summarization
Multiple KPIs
Curated by expert
Some flexibility
Often operational in nature or usage

Headline Center, Sub, Labels, 6-Screens Yellow
Stream buffer
Kafka
Stream buffer
Pubsub
Batch processing
orchestration
Airflow
Bundle storage
Storage
Key architecture components for data flow control
ICON

Key architecture components for business logic
Stream and Batch
processing
Dataflow
Pipeline business logic
Beam
Popular language
Python
Popular language
Java
Stream and batch
processing
Spark

Headline Center, Sub, Labels, 6-Screens Yellow
Bulk data warehousing
Big Query
Exploratory data storage
Druid
Druid centric
dashboarding
Superset
General dashboarding
Looker
Key architecture components for data consumption

Core event log workflows
GDPR
SOX
● Bundle lands in GCS
● Airflow churns data
between BigQuery and
GCS
● Over 20k DAG runs a
week
● Lots of access control

Druid vs BigQuery
Druid
Multi cloud compatible.
Higher friction data load.
Lower friction data maintenance.
Gets more affordable with more usage.
You will track who has the most data.
Very fast.
Slice and dice.
BigQuery
Fully managed and hosted, GCP-only.
Low friction data load.
High friction data maintenance.
Price punishment for using too much.
You will track who is causing cost spikes.
Often slow, but faster than hadoop.
Joins.
Internal use cases for Druid vs BigQuery

Large compute capacity
Cores
>10k
Flowing into Druid
Events per day
>100B
Answered
Queries per day
>100k
Key Druid stats

Druid ingestion and consumption
Reports /
Dashboards
SME
Dashboards
Drill Down

Data Storage & Querying
Platform
Platform GKE Cluster
ZooKeeper
Coordination &
configuration
Druid
Indexed datastore
Java, Druid
Druid
Indexed datastore
Java, Druid
Druid Broker
Druid Historicals*
Druid Coordinator
Java, CoreOS, Druid,
GCE
Mesos
Cluster Management
GCE
Marathon
Orchestration
GCE
GCS
Deep
Storage
CloudSQL
Druid
Metadata
ZooKeeper
Coordination &
Configuration
ZooKeeper
Coordination &
configuration
MongoDB
Query Time Lookup
Cache
● GCP Deployment Manager
● Helm

Recent data FAST
NVME-SSD
1 Week
2 Hot
Recent data HA
1 Week
1 Cold
Keep older data available
Older Data
HADruid retention
tunings

We Are Hiring!
charles.allen@snap.com
https://www.snap.com/jobs/

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

More Related Content

What's hot (20)

Similar to Data Analytics and Processing at Snap - Druid Meetup LA - September 2018 (20)

Recently uploaded (20)

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018