SlideShare a Scribd company logo
Snapchat 2018
Analytics at
Snap
Big Data processing, slicing, and dicing
Charles Allen
charles.allen@snap.com
https://www.linkedin.com/in/charles-allen-255bab2a/
09.20.18
Who we are
Snap growth
Wrangling Data / Data tool chest
Druid’s powerhouse
Overview
Who we are
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Snap Inc. is a camera company
Express yourself!
place creative here place creative here
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Live in the moment
place creative here
Snap growth
Million DAU Q2
2014
Million DAU Q2
188
2018
Source: 10-K; 10-Q; earnings call transcripts
User base up
Advertiser value up
57
Trillions of interactions per
week.
Wrangling data
Lack of data
causes pain
Natural pipeline development
Need
Find data signal,
and data
processing SME
Source
Work with
development
team for pipeline
Develop
To production!
Deploy
Fire and forget,
or keep it live?
Maintain
Getting insights into data
Common data consumption formats
Scripting
High level of expertise
Extremely dynamic
Usually either one-off for a specific
human. Or scripted for machine
consumption.
DashboardsReports
Small qty of KPIs
Big tables or worksheets
“Executive” summarization
Multiple KPIs
Curated by expert
Some flexibility
Often operational in nature or usage
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data tool chest
Headline Center, Sub, Labels, 6-Screens Yellow
Stream buffer
Kafka
Stream buffer
Pubsub
Batch processing
orchestration
Airflow
Bundle storage
Storage
Key architecture components for data flow control
ICON
Key architecture components for business logic
Stream and Batch
processing
Dataflow
Pipeline business logic
Beam
Popular language
Python
Popular language
Java
Stream and batch
processing
Spark
Headline Center, Sub, Labels, 6-Screens Yellow
Bulk data warehousing
Big Query
Exploratory data storage
Druid
Druid centric
dashboarding
Superset
General dashboarding
Looker
Key architecture components for data consumption
Core event log workflows
GDPR
SOX
● Bundle lands in GCS
● Airflow churns data
between BigQuery and
GCS
● Over 20k DAG runs a
week
● Lots of access control
Druid vs BigQuery
Druid
Multi cloud compatible.
Higher friction data load.
Lower friction data maintenance.
Gets more affordable with more usage.
You will track who has the most data.
Very fast.
Slice and dice.
BigQuery
Fully managed and hosted, GCP-only.
Low friction data load.
High friction data maintenance.
Price punishment for using too much.
You will track who is causing cost spikes.
Often slow, but faster than hadoop.
Joins.
Internal use cases for Druid vs BigQuery
Druid’s powerhouse
Large compute capacity
Cores
>10k
Flowing into Druid
Events per day
>100B
Answered
Queries per day
>100k
Key Druid stats
Druid ingestion and consumption
Reports /
Dashboards
SME
Dashboards
Drill Down
Data Storage & Querying
Platform
Platform GKE Cluster
ZooKeeper
Coordination &
configuration
Druid
Indexed datastore
Java, Druid
Druid
Indexed datastore
Java, Druid
Druid Broker
Druid Historicals*
Druid Coordinator
Java, CoreOS, Druid,
GCE
Mesos
Cluster Management
GCE
Marathon
Orchestration
GCE
GCS
Deep
Storage
CloudSQL
Druid
Metadata
ZooKeeper
Coordination &
Configuration
ZooKeeper
Coordination &
configuration
MongoDB
Query Time Lookup
Cache
● GCP Deployment Manager
● Helm
Recent data FAST
NVME-SSD
1 Week
2 Hot
Recent data HA
1 Week
1 Cold
Keep older data available
Older Data
HADruid retention
tunings
We Are Hiring!
charles.allen@snap.com
https://www.snap.com/jobs/

More Related Content

PDF
Web analytics at scale with Druid at naver.com
PDF
Improving Apache Spark Downscaling
PDF
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
PPTX
Exactly-once Stream Processing with Kafka Streams
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
SQL on everything, in memory
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Web analytics at scale with Druid at naver.com
Improving Apache Spark Downscaling
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
Exactly-once Stream Processing with Kafka Streams
Building robust CDC pipeline with Apache Hudi and Debezium
SQL on everything, in memory
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

What's hot (20)

PDF
Apache Flink internals
PDF
Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
PDF
Why Spark Is the Next Top (Compute) Model
PDF
톰캣 운영 노하우
PDF
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Apache pulsar - storage architecture
PPTX
Running & Scaling Large Elasticsearch Clusters
PDF
Apache Kafka - Martin Podval
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
PPTX
CockroachDB
PDF
Splunk: Druid on Kubernetes with Druid-operator
PDF
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PPTX
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
PDF
NoSQL databases
Apache Flink internals
Deep Dive into Building Streaming Applications with Apache Pulsar
What’s New in the Upcoming Apache Spark 3.0
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Why Spark Is the Next Top (Compute) Model
톰캣 운영 노하우
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Apache pulsar - storage architecture
Running & Scaling Large Elasticsearch Clusters
Apache Kafka - Martin Podval
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
CockroachDB
Splunk: Druid on Kubernetes with Druid-operator
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
NoSQL databases
Ad

Similar to Data Analytics and Processing at Snap - Druid Meetup LA - September 2018 (20)

PPTX
Big Data on Azure Tutorial
PPTX
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
PPTX
KNIME Meetup 2016-04-16
PPTX
The new dominant companies are running on data
PPTX
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
PDF
The Future of Data Management: The Enterprise Data Hub
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
 
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PDF
How to implement Hadoop successfully
PPTX
Unlock Data-driven Insights in Databricks Using Location Intelligence
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
PPTX
PDF
Analytics in a Day Virtual Workshop
 
PPT
8.17.11 big data and hadoop with informatica slideshare
PDF
Data lake benefits
PPTX
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
Big Data on Azure Tutorial
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
KNIME Meetup 2016-04-16
The new dominant companies are running on data
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
The Future of Data Management: The Enterprise Data Hub
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Analytics in a Day Ft. Synapse Virtual Workshop
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
How to implement Hadoop successfully
Unlock Data-driven Insights in Databricks Using Location Intelligence
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Analytics in a Day Virtual Workshop
 
8.17.11 big data and hadoop with informatica slideshare
Data lake benefits
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
Ad

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Quality review (1)_presentation of this 21
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Lecture1 pattern recognition............
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Database Infoormation System (DBIS).pptx
Launch Your Data Science Career in Kochi – 2025
Quality review (1)_presentation of this 21
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Reliability_Chapter_ presentation 1221.5784
climate analysis of Dhaka ,Banglades.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Data_Analytics_and_PowerBI_Presentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Lecture1 pattern recognition............
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

  • 1. Snapchat 2018 Analytics at Snap Big Data processing, slicing, and dicing Charles Allen charles.allen@snap.com https://www.linkedin.com/in/charles-allen-255bab2a/
  • 2. 09.20.18 Who we are Snap growth Wrangling Data / Data tool chest Druid’s powerhouse Overview
  • 6. Snap Inc. is a camera company
  • 7. Express yourself! place creative here place creative here
  • 9. Live in the moment place creative here
  • 11. Million DAU Q2 2014 Million DAU Q2 188 2018 Source: 10-K; 10-Q; earnings call transcripts User base up Advertiser value up 57
  • 14. Lack of data causes pain Natural pipeline development Need Find data signal, and data processing SME Source Work with development team for pipeline Develop To production! Deploy Fire and forget, or keep it live? Maintain Getting insights into data
  • 15. Common data consumption formats Scripting High level of expertise Extremely dynamic Usually either one-off for a specific human. Or scripted for machine consumption. DashboardsReports Small qty of KPIs Big tables or worksheets “Executive” summarization Multiple KPIs Curated by expert Some flexibility Often operational in nature or usage
  • 18. Headline Center, Sub, Labels, 6-Screens Yellow Stream buffer Kafka Stream buffer Pubsub Batch processing orchestration Airflow Bundle storage Storage Key architecture components for data flow control ICON
  • 19. Key architecture components for business logic Stream and Batch processing Dataflow Pipeline business logic Beam Popular language Python Popular language Java Stream and batch processing Spark
  • 20. Headline Center, Sub, Labels, 6-Screens Yellow Bulk data warehousing Big Query Exploratory data storage Druid Druid centric dashboarding Superset General dashboarding Looker Key architecture components for data consumption
  • 21. Core event log workflows GDPR SOX ● Bundle lands in GCS ● Airflow churns data between BigQuery and GCS ● Over 20k DAG runs a week ● Lots of access control
  • 22. Druid vs BigQuery Druid Multi cloud compatible. Higher friction data load. Lower friction data maintenance. Gets more affordable with more usage. You will track who has the most data. Very fast. Slice and dice. BigQuery Fully managed and hosted, GCP-only. Low friction data load. High friction data maintenance. Price punishment for using too much. You will track who is causing cost spikes. Often slow, but faster than hadoop. Joins. Internal use cases for Druid vs BigQuery
  • 24. Large compute capacity Cores >10k Flowing into Druid Events per day >100B Answered Queries per day >100k Key Druid stats
  • 25. Druid ingestion and consumption Reports / Dashboards SME Dashboards Drill Down
  • 26. Data Storage & Querying Platform Platform GKE Cluster ZooKeeper Coordination & configuration Druid Indexed datastore Java, Druid Druid Indexed datastore Java, Druid Druid Broker Druid Historicals* Druid Coordinator Java, CoreOS, Druid, GCE Mesos Cluster Management GCE Marathon Orchestration GCE GCS Deep Storage CloudSQL Druid Metadata ZooKeeper Coordination & Configuration ZooKeeper Coordination & configuration MongoDB Query Time Lookup Cache ● GCP Deployment Manager ● Helm
  • 27. Recent data FAST NVME-SSD 1 Week 2 Hot Recent data HA 1 Week 1 Cold Keep older data available Older Data HADruid retention tunings