SlideShare a Scribd company logo
The Data Mullet 
From All SQL to No SQL back to Some SQL 
Alexis Lê-Quôc @alq
The Data Mullet 
From All SQL to No SQL back to Some SQL 
Alexis Lê-Quôc @alq
Alexis Lê-Quôc @alq 
This Talk 
• A (mostly) DIRTy Architecture for... 
• A new application (datadoghq.com) on a limited budget 
• Running on a public cloud 
• Focussing on data stores.
Some context
Servers 
Monitoring 
IaaS, PaaS Usage Analytics 
Perf. Management 
Apps 
Hosting 
CDNs Asset Management 
SDLC 
Ops team Dev team
The Data Mullet: From all SQL to No SQL back to Some SQL
Dev & Ops “collaborate” 
Alexis Lê-Quôc @alq
Concretely, what does Datadog 
do?
Alexis Lê-Quôc @alq 
etc.
Watching real 
time feeds 
Looking for patterns 
Alexis Lê-Quôc @alq 
Constant telemetry 
Real-time 
Bursty batches 
Share
Alexis Lê-Quôc @alq 
Data Taxonomy 
Metrics 
Unique visitors 
Load 
Transaction duration 
... 
Events 
Conversations 
Alerts 
Build & Deploys 
...
Alexis Lê-Quôc @alq 
Unit of scale 
• 1 source, typically a server 
• 100 metrics 
• Every 15 s 
• 24,000 points per hour 
• ~3 bytes per point 
• 100 KB/hour, 850 MB/year 
• Events 
• whenever they occur 
• Highest resolution: 1s 
• Small payload + metadata
Alexis Lê-Quôc @alq 
ACID, BASE & DIRT 
• ACID 
• http://en.wikipedia.org/wiki/ACID 
• BASE 
• http://en.wikipedia.org/wiki/Eventual_consistency 
• DIRT (Bryan Cantrill at Surge 2010) 
• http://dtrace.org/resources/bmc/DIRT.pdf
Let’s dig some DIRT
DI-RealTime
Alexis Lê-Quôc @alq 
The Consequences of DIRT? 
Latency 
• Data consumed by people (and machines) 
• Low end-to-end latency (5-15s) 
• Psycho-physiological Factor 
• Same order of magnitude as email/SMS* 
http://citeseerx.ist.psu.edu/viewdoc/* download?doi=10.1.1.76.2465&rep=rep1&type=pdf
Alexis Lê-Quôc @alq 
The Consequences of DIRT? 
Concurrency 
• Concurrent events & data points show up in sync 
• Access Patterns? 
• All recent data, e.g. last 24 hours
Alexis Lê-Quôc @alq 
The Consequences of DIRT? 
Tolerance to noise 
• Not a System of Record 
• “Real-time” decisions 
• Drop (some) individual data points rather be late 
• Applies to metrics, not events
Noise but no Latency Latency but no Noise 
Alexis Lê-Quôc @alq 
Cross here? Or here?
DataIntensive-RT
Alexis Lê-Quôc @alq 
The Consequences of DIRT? 
Storage 
• Business Cycles 
• Retention Policy > Business Cycle 
• E.g. retail, education 12 months 
• Elastic Storage 
• !CAPEX
Alexis Lê-Quôc @alq 
The Consequences of DIRT? 
Latency 
• Datadog, a data exploration app for people 
• Looking for patterns 
• Ideal: 300 ms round-trip 
• Access patterns for long-term data? 
• Storage trade-off: precompute oft-used properties 
• Run-time Trade-off: want longer timespan, get lower resolution 
• != RRD
Alexis Lê-Quôc @alq
Alexis Lê-Quôc @alq 
Aggregate 
Constant data influx 
Large data sets
Alexis Lê-Quôc @alq 
Aggregate 
Constant data influx 
Large data sets 
Watch & Share 
Real-time updates 
On-the-fly data analysis
Watch & Share 
Real-time updates 
On-the-fly data analysis 
Alexis Lê-Quôc @alq 
Aggregate 
Constant data influx 
Large data sets 
Look for Patterns 
On-demand visualization 
Background data analysis
Watch & Share 
Real-time updates 
On-the-fly data analysis 
Alexis Lê-Quôc @alq 
Aggregate 
Constant BASE 
data DIRT 
influx 
Large data sets 
Look for Patterns 
On-demand visualization 
Background data analysis
Watch Real-time DIRT 
& Share 
updates 
On-the-fly data analysis 
Alexis Lê-Quôc @alq 
Aggregate 
Constant BASE 
data DIRT 
influx 
Large data sets 
Look for Patterns 
On-demand visualization 
Background data analysis
Watch Real-time DIRT 
& Share 
updates 
On-the-fly data analysis 
Alexis Lê-Quôc @alq 
Aggregate 
Constant BASE 
data DIRT 
influx 
Large data sets 
Look for On-demand BASE 
Patterns 
visualization 
Background data analysis
Watch Real-time DIRT 
& Share 
updates 
On-the-fly data analysis 
Alexis Lê-Quôc @alq 
Aggregate 
Constant BASE 
data DIRT 
influx 
Large data sets 
Look for On-demand BASE 
Patterns 
visualization 
Background data analysis 
Datadog = DIRT + BASE + a tiny bit of ACID
Alexis Lê-Quôc @alq 
How It All Fits Together
Alexis Lê-Quôc @alq 
The Mullet 
All SQL in front, NoSQL party in the back
Alexis Lê-Quôc @alq 
Actual Stack
Alexis Lê-Quôc @alq 
Choices, choices 
• 5 axes 
• Volume of Data 
• Latency 
• Ops: wake-up-in-the-middle-of-the-night factor 
• Dev: community & tools 
• Cost as in “a function of X”
Choosing Elastic Storage
Alexis Lê-Quôc @alq 
Durable, Large-Scale Storage 
• Postgres 
• Mongo 
• Cassandra 
• (Riak) 
• SciDB
Alexis Lê-Quôc @alq 
Durable, Large-Scale Storage 
• Postgres 
• Itemized data points in a time series are useless 
• BLOB management not fun 
• Mongo 
• Cassandra 
• (Riak) 
• SciDB
Alexis Lê-Quôc @alq 
Durable, Large-Scale Storage 
• Postgres 
• Mongo 
• SciDB 
• Cassandra 
• (Riak)
Alexis Lê-Quôc @alq 
Durable, Large-Scale Storage 
• Postgres 
• Mongo 
• Durability in question in 2010 
• SciDB 
• Cassandra 
• (Riak)
Alexis Lê-Quôc @alq 
Durable, Large-Scale Storage 
• Postgres 
• Mongo 
• SciDB 
• Very very early 
• Cassandra 
• (Riak)
Alexis Lê-Quôc @alq 
Durable, Large-Scale Storage 
• Postgres 
• Mongo 
• SciDB 
• Our pick: Cassandra 
• (Riak)
Alexis Lê-Quôc @alq 
Cassandra: Volume of Data 
• 100s of hosts, 150TB at FB in 2010 
• Easy to distribute data, durable quorum writes
Alexis Lê-Quôc @alq 
Cassandra: Latency 
• < 10ms on writes 
• reads more variable (on EC2)* 
* More on this in a bit
Alexis Lê-Quôc @alq 
Cassandra: Ops 
• Release Engineering too aggressive 
• ~10 releases since 1/2011 on 0.7 branch 
• Good resilience to node loss in the later 0.7 versions 
• Annoying idiosyncrasies (cassandra.yaml, predictability of disk use)
Alexis Lê-Quôc @alq 
Cassandra: Dev 
• Bizarre nomenclature (rows, columns... families?) 
• Cumbersome data access 
• Limited Semantics when used to SQL 
• Good libraries
Alexis Lê-Quôc @alq 
Cassandra: Cost 
• Ops time 
• I/O limits raised by increasing number of nodes 
• Thereby increasing costs,
Alexis Lê-Quôc @alq 
Riak 
• Prototyped out of spite for Cassandra 0.7[0123] 
• We ♡ Erlang 
• Great folks 
• But Cassandra pain subsided, priorities shifted. 
• git merge datadog/riak did not happen
Choosing In-Mem
Alexis Lê-Quôc @alq 
In-memory DB 
• We started with Redis 
• Then we stopped looking :)
Alexis Lê-Quôc @alq 
Redis 
• Volume of Data 
• Limited by available RAM, easy partitioning in our case 
• Latency 
• << 5 ms, dominated by network 
• Ops 
• Low-maintenance, stable, predictable, replicated, boringly rock-solid 
• Dev 
• Brilliant, clear docs, simple protocol, oft-used native data structures 
• Cost 
• ~ cost of RAM on EC2
Choosing a SQL Data Store
Alexis Lê-Quôc @alq 
General-purpose data store 
• We ♡ SQL 
• Oracle 
• Postgres
Alexis Lê-Quôc @alq 
Oracle in numbers 
• base license 47.5 
• clustered db 23 
• replication 10 
• partitioning 11.5 
• analytics 23 
• in-mem cache 23 
• total: $138,000
Alexis Lê-Quôc @alq 
Oracle in numbers 
• base license 47.5 
• clustered db 23 
• replication 10 
• partitioning 11.5 
• analytics 23 
• in-mem cache 23 
• total: $138,000 
• for 2 cores 
• + 22% annual support 
• Just in licenses...
Alexis Lê-Quôc @alq 
Oracle in numbers 
• base license 47.5 
• clustered db 23 
• replication 10 
• partitioning 11.5 
• analytics 23 
• in-mem cache 23 
• total: $138,000 
• for 2 cores 
• + 22% annual support 
• Just in licenses...
Alexis Lê-Quôc @alq 
General-purpose data store 
• Oracle 
• Postgres
Alexis Lê-Quôc @alq 
Postgres 
• Volume of Data 
• High GBs, Low TBs 
• Latency 
• 10-100 ms after EXPLAIN ANALYZE 
• Ops 
• Low-maintenance, stable, predictable, replicated, boringly rock-solid 
• Dev 
• Well understood by (a certain class of) engineers 
• Cost, a function of storage latency
Alexis Lê-Quôc @alq 
Not forgetting... 
• VoltDB 
• RAM-based, potentially a match for our DIRTy parts 
• Stored procedures, an acquired taste 
• Home-grow data stores (soon) 
• Rainbird 
• ...
Alexis Lê-Quôc @alq 
The Data Mullet 
• All open-source, good if you’re ready to dive in code 
• $0 CAPEX 
• All OPEX on EC2
Alexis Lê-Quôc @alq 
The Data Mullet on EC2 
Structural Weakness: I/O latency at moderate throughputs
One “bad” cassandra query 
Alexis Lê-Quôc @alq
Clogging the I/O pipes on EC2 
Alexis Lê-Quôc @alq 
Maximum Average Wait: up to 670 ms 
Maximum Service Time: up to 5 ms 
While writing 100 MB/s 
and reading 30 MB/s
Alexis Lê-Quôc @alq 
Average wait in ms 
Transfer per seconds 
Consumer HD: ~75 tps 
SSD: 1-30 Ktps 
DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 
03:35:02 PM dev8-80 380 24000 5.7 62 47 130 1.3 47 
03:35:02 PM dev8-96 370 24000 5.6 63 46 120 1.2 45 
03:35:02 PM dev8-112 380 24000 5.5 63 46 120 1.2 46 
03:35:02 PM dev8-128 380 24000 7.2 63 56 150 1.3 50 
Average service time in 
ms 
Read throughput in sector/s 
Total: 46 MB/s 
Another “Bad” Query
Mitigation of I/O issues?
Alexis Lê-Quôc @alq 
Cassandra: I/O Mitigation 
• More nodes, more RAM, more partitions, more parallelism 
• $$$
Alexis Lê-Quôc @alq 
Postgres: I/O Mitigation 
• Scale up to a point 
• Replicate 
• Move to bare Metal => $$$ 
• A well-trodden but difficult path
Alexis Lê-Quôc @alq 
Better yet... 
• Less dependency on low-latency, durable storage 
• Move more data to RAM (Redis) 
• Archive immutable data 
• S3/Cloudfront?
Alexis Lê-Quôc @alq 
A digression: 
Your Very Own Chaos Monkey 
• Instances go bye-bye 
• https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/741224 
• Instances go bye-bye, take 2 (high load) 
• https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/708920
Alexis Lê-Quôc @alq 
Takeaway 
• By mixing and matching open-source SQL (PG) and NoSQL (Redis, 
Cassandra) Datadog has been able to quickly & simply get up-and-running 
with “$0” down payment on infrastructure.
http://datadoghq.com 
@datadoghq 
Alexis Lê-Quôc @alq

More Related Content

PPTX
Presentation1.pptx
PPTX
Cryptocurrency
PPTX
Cryptography and authentication
PPTX
Caesar cipher
PPTX
Bitcoin
PDF
Blockchain Security Issues and Challenges
PDF
ERC20 Token Contract
PPTX
AI in Forex Trading.pptx
Presentation1.pptx
Cryptocurrency
Cryptography and authentication
Caesar cipher
Bitcoin
Blockchain Security Issues and Challenges
ERC20 Token Contract
AI in Forex Trading.pptx

What's hot (20)

PDF
Introduction to CQL and Data Modeling with Apache Cassandra
PPTX
Bitcoin presentation slides
PPTX
Blockchain Introduction Presentation
PPTX
Blockchain Technology ppt project.pptx
PDF
RSA ALGORITHM
PDF
What is Bitcoin? - A guide for beginners
PPTX
Cloud Computing and Big Data
PDF
NoSQL
PPTX
Cryptographic algorithms
KEY
Introduction to bitcoin
PDF
Blockchain Technology | Blockchain Explained | Blockchain Tutorial | Blockcha...
PDF
Metaverse_SGAsia2022.pdf
PPTX
Advantages,disadvantages,applications and economic aspects of bitcoin
PPTX
Laser cutter training
PPTX
Blockchain: The New Technology of Trust
PPTX
Introduction to Blockchain
PPTX
また巨大数の話
PDF
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
PDF
Philosophical Ethics Vs Computer Ethics
PPT
富爸爸、窮爸爸222.ppt
Introduction to CQL and Data Modeling with Apache Cassandra
Bitcoin presentation slides
Blockchain Introduction Presentation
Blockchain Technology ppt project.pptx
RSA ALGORITHM
What is Bitcoin? - A guide for beginners
Cloud Computing and Big Data
NoSQL
Cryptographic algorithms
Introduction to bitcoin
Blockchain Technology | Blockchain Explained | Blockchain Tutorial | Blockcha...
Metaverse_SGAsia2022.pdf
Advantages,disadvantages,applications and economic aspects of bitcoin
Laser cutter training
Blockchain: The New Technology of Trust
Introduction to Blockchain
また巨大数の話
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Philosophical Ethics Vs Computer Ethics
富爸爸、窮爸爸222.ppt
Ad

Viewers also liked (19)

PDF
Treating Infrastructure as Garbage
PDF
Alerting: more signal, less noise, less pain
PDF
DevOps, continuous delivery, & the new composable enterprise
PDF
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
PDF
Monitoring MySQL at scale
PDF
I &lt;3 graphs in 20 slides
PDF
Events and metrics the Lifeblood of Webops
PDF
Big (IT) data
PDF
Fact based monitoring
PDF
Deep dive into Nagios analytics
PDF
Just enough web ops for web developers
PDF
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
PDF
Customer Ops: DevOps &lt;3 customer support
PDF
Effective monitoring with StatsD
PPTX
Monitoring Docker containers - Docker NYC Feb 2015
PDF
Monitoring NGINX (plus): key metrics and how-to
PDF
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PDF
How to measure everything - a million metrics per second with minimal develop...
PDF
Application Monitoring using Datadog
Treating Infrastructure as Garbage
Alerting: more signal, less noise, less pain
DevOps, continuous delivery, & the new composable enterprise
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
Monitoring MySQL at scale
I &lt;3 graphs in 20 slides
Events and metrics the Lifeblood of Webops
Big (IT) data
Fact based monitoring
Deep dive into Nagios analytics
Just enough web ops for web developers
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Customer Ops: DevOps &lt;3 customer support
Effective monitoring with StatsD
Monitoring Docker containers - Docker NYC Feb 2015
Monitoring NGINX (plus): key metrics and how-to
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
How to measure everything - a million metrics per second with minimal develop...
Application Monitoring using Datadog
Ad

Similar to The Data Mullet: From all SQL to No SQL back to Some SQL (20)

PPTX
Scality S3 Server: Node js Meetup Presentation
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PDF
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
PDF
Webinar: The Future of SQL
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
SFScon18 - Stefano Pampaloni - The SQL revenge
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Cost Effectively Run Multiple Oracle Database Copies at Scale
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PPTX
NewSQL - Deliverance from BASE and back to SQL and ACID
PDF
Overview of data analytics service: Treasure Data Service
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PPTX
Riga dev day: Lambda architecture at AWS
KEY
London devops logging
PPTX
AWS Lambda support for AWS X-Ray
PPTX
0bbleedingedge long-140614012258-phpapp02 lynn-langit
PPTX
Bleeding Edge Databases
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Scality S3 Server: Node js Meetup Presentation
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Webinar: The Future of SQL
Streaming Analytics with Spark, Kafka, Cassandra and Akka
SFScon18 - Stefano Pampaloni - The SQL revenge
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Cost Effectively Run Multiple Oracle Database Copies at Scale
Meetup#2: Building responsive Symbology & Suggest WebService
NewSQL - Deliverance from BASE and back to SQL and ACID
Overview of data analytics service: Treasure Data Service
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Riga dev day: Lambda architecture at AWS
London devops logging
AWS Lambda support for AWS X-Ray
0bbleedingedge long-140614012258-phpapp02 lynn-langit
Bleeding Edge Databases
Build a Time Series Application with Apache Spark and Apache HBase
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud

More from Datadog (15)

PPTX
What it Means to be a Next-Generation Managed Service Provider
PPTX
Lifting the Blinds: Monitoring Windows Server 2012
PDF
Monitoring kubernetes across data center and cloud
PDF
Datadog + VictorOps Webinar
PDF
Dataday Texas 2016 - Datadog
PDF
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
PDF
Running & Monitoring Docker at Scale
PDF
Fact-Based Monitoring
PDF
What’s in this Cookbook? - Mike Fiedler
PDF
I Love Graphs - Alexis Lê-Quôc
PDF
Virtualization at Gilt - Rangarajan Radhakrishnan
PDF
Why Puppet Sucks - Rob Terhaar
PDF
Welcome to a Computing Revolution - Alex Lesser
PDF
Cosa Nostra - Tom Santero
PDF
Bulk Exporting from Cassandra - Carlo Cabanilla
What it Means to be a Next-Generation Managed Service Provider
Lifting the Blinds: Monitoring Windows Server 2012
Monitoring kubernetes across data center and cloud
Datadog + VictorOps Webinar
Dataday Texas 2016 - Datadog
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
Running & Monitoring Docker at Scale
Fact-Based Monitoring
What’s in this Cookbook? - Mike Fiedler
I Love Graphs - Alexis Lê-Quôc
Virtualization at Gilt - Rangarajan Radhakrishnan
Why Puppet Sucks - Rob Terhaar
Welcome to a Computing Revolution - Alex Lesser
Cosa Nostra - Tom Santero
Bulk Exporting from Cassandra - Carlo Cabanilla

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectroscopy.pptx food analysis technology
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
Spectral efficient network and resource selection model in 5G networks
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

The Data Mullet: From all SQL to No SQL back to Some SQL

  • 1. The Data Mullet From All SQL to No SQL back to Some SQL Alexis Lê-Quôc @alq
  • 2. The Data Mullet From All SQL to No SQL back to Some SQL Alexis Lê-Quôc @alq
  • 3. Alexis Lê-Quôc @alq This Talk • A (mostly) DIRTy Architecture for... • A new application (datadoghq.com) on a limited budget • Running on a public cloud • Focussing on data stores.
  • 5. Servers Monitoring IaaS, PaaS Usage Analytics Perf. Management Apps Hosting CDNs Asset Management SDLC Ops team Dev team
  • 7. Dev & Ops “collaborate” Alexis Lê-Quôc @alq
  • 8. Concretely, what does Datadog do?
  • 10. Watching real time feeds Looking for patterns Alexis Lê-Quôc @alq Constant telemetry Real-time Bursty batches Share
  • 11. Alexis Lê-Quôc @alq Data Taxonomy Metrics Unique visitors Load Transaction duration ... Events Conversations Alerts Build & Deploys ...
  • 12. Alexis Lê-Quôc @alq Unit of scale • 1 source, typically a server • 100 metrics • Every 15 s • 24,000 points per hour • ~3 bytes per point • 100 KB/hour, 850 MB/year • Events • whenever they occur • Highest resolution: 1s • Small payload + metadata
  • 13. Alexis Lê-Quôc @alq ACID, BASE & DIRT • ACID • http://en.wikipedia.org/wiki/ACID • BASE • http://en.wikipedia.org/wiki/Eventual_consistency • DIRT (Bryan Cantrill at Surge 2010) • http://dtrace.org/resources/bmc/DIRT.pdf
  • 16. Alexis Lê-Quôc @alq The Consequences of DIRT? Latency • Data consumed by people (and machines) • Low end-to-end latency (5-15s) • Psycho-physiological Factor • Same order of magnitude as email/SMS* http://citeseerx.ist.psu.edu/viewdoc/* download?doi=10.1.1.76.2465&rep=rep1&type=pdf
  • 17. Alexis Lê-Quôc @alq The Consequences of DIRT? Concurrency • Concurrent events & data points show up in sync • Access Patterns? • All recent data, e.g. last 24 hours
  • 18. Alexis Lê-Quôc @alq The Consequences of DIRT? Tolerance to noise • Not a System of Record • “Real-time” decisions • Drop (some) individual data points rather be late • Applies to metrics, not events
  • 19. Noise but no Latency Latency but no Noise Alexis Lê-Quôc @alq Cross here? Or here?
  • 21. Alexis Lê-Quôc @alq The Consequences of DIRT? Storage • Business Cycles • Retention Policy > Business Cycle • E.g. retail, education 12 months • Elastic Storage • !CAPEX
  • 22. Alexis Lê-Quôc @alq The Consequences of DIRT? Latency • Datadog, a data exploration app for people • Looking for patterns • Ideal: 300 ms round-trip • Access patterns for long-term data? • Storage trade-off: precompute oft-used properties • Run-time Trade-off: want longer timespan, get lower resolution • != RRD
  • 24. Alexis Lê-Quôc @alq Aggregate Constant data influx Large data sets
  • 25. Alexis Lê-Quôc @alq Aggregate Constant data influx Large data sets Watch & Share Real-time updates On-the-fly data analysis
  • 26. Watch & Share Real-time updates On-the-fly data analysis Alexis Lê-Quôc @alq Aggregate Constant data influx Large data sets Look for Patterns On-demand visualization Background data analysis
  • 27. Watch & Share Real-time updates On-the-fly data analysis Alexis Lê-Quôc @alq Aggregate Constant BASE data DIRT influx Large data sets Look for Patterns On-demand visualization Background data analysis
  • 28. Watch Real-time DIRT & Share updates On-the-fly data analysis Alexis Lê-Quôc @alq Aggregate Constant BASE data DIRT influx Large data sets Look for Patterns On-demand visualization Background data analysis
  • 29. Watch Real-time DIRT & Share updates On-the-fly data analysis Alexis Lê-Quôc @alq Aggregate Constant BASE data DIRT influx Large data sets Look for On-demand BASE Patterns visualization Background data analysis
  • 30. Watch Real-time DIRT & Share updates On-the-fly data analysis Alexis Lê-Quôc @alq Aggregate Constant BASE data DIRT influx Large data sets Look for On-demand BASE Patterns visualization Background data analysis Datadog = DIRT + BASE + a tiny bit of ACID
  • 31. Alexis Lê-Quôc @alq How It All Fits Together
  • 32. Alexis Lê-Quôc @alq The Mullet All SQL in front, NoSQL party in the back
  • 33. Alexis Lê-Quôc @alq Actual Stack
  • 34. Alexis Lê-Quôc @alq Choices, choices • 5 axes • Volume of Data • Latency • Ops: wake-up-in-the-middle-of-the-night factor • Dev: community & tools • Cost as in “a function of X”
  • 36. Alexis Lê-Quôc @alq Durable, Large-Scale Storage • Postgres • Mongo • Cassandra • (Riak) • SciDB
  • 37. Alexis Lê-Quôc @alq Durable, Large-Scale Storage • Postgres • Itemized data points in a time series are useless • BLOB management not fun • Mongo • Cassandra • (Riak) • SciDB
  • 38. Alexis Lê-Quôc @alq Durable, Large-Scale Storage • Postgres • Mongo • SciDB • Cassandra • (Riak)
  • 39. Alexis Lê-Quôc @alq Durable, Large-Scale Storage • Postgres • Mongo • Durability in question in 2010 • SciDB • Cassandra • (Riak)
  • 40. Alexis Lê-Quôc @alq Durable, Large-Scale Storage • Postgres • Mongo • SciDB • Very very early • Cassandra • (Riak)
  • 41. Alexis Lê-Quôc @alq Durable, Large-Scale Storage • Postgres • Mongo • SciDB • Our pick: Cassandra • (Riak)
  • 42. Alexis Lê-Quôc @alq Cassandra: Volume of Data • 100s of hosts, 150TB at FB in 2010 • Easy to distribute data, durable quorum writes
  • 43. Alexis Lê-Quôc @alq Cassandra: Latency • < 10ms on writes • reads more variable (on EC2)* * More on this in a bit
  • 44. Alexis Lê-Quôc @alq Cassandra: Ops • Release Engineering too aggressive • ~10 releases since 1/2011 on 0.7 branch • Good resilience to node loss in the later 0.7 versions • Annoying idiosyncrasies (cassandra.yaml, predictability of disk use)
  • 45. Alexis Lê-Quôc @alq Cassandra: Dev • Bizarre nomenclature (rows, columns... families?) • Cumbersome data access • Limited Semantics when used to SQL • Good libraries
  • 46. Alexis Lê-Quôc @alq Cassandra: Cost • Ops time • I/O limits raised by increasing number of nodes • Thereby increasing costs,
  • 47. Alexis Lê-Quôc @alq Riak • Prototyped out of spite for Cassandra 0.7[0123] • We ♡ Erlang • Great folks • But Cassandra pain subsided, priorities shifted. • git merge datadog/riak did not happen
  • 49. Alexis Lê-Quôc @alq In-memory DB • We started with Redis • Then we stopped looking :)
  • 50. Alexis Lê-Quôc @alq Redis • Volume of Data • Limited by available RAM, easy partitioning in our case • Latency • << 5 ms, dominated by network • Ops • Low-maintenance, stable, predictable, replicated, boringly rock-solid • Dev • Brilliant, clear docs, simple protocol, oft-used native data structures • Cost • ~ cost of RAM on EC2
  • 51. Choosing a SQL Data Store
  • 52. Alexis Lê-Quôc @alq General-purpose data store • We ♡ SQL • Oracle • Postgres
  • 53. Alexis Lê-Quôc @alq Oracle in numbers • base license 47.5 • clustered db 23 • replication 10 • partitioning 11.5 • analytics 23 • in-mem cache 23 • total: $138,000
  • 54. Alexis Lê-Quôc @alq Oracle in numbers • base license 47.5 • clustered db 23 • replication 10 • partitioning 11.5 • analytics 23 • in-mem cache 23 • total: $138,000 • for 2 cores • + 22% annual support • Just in licenses...
  • 55. Alexis Lê-Quôc @alq Oracle in numbers • base license 47.5 • clustered db 23 • replication 10 • partitioning 11.5 • analytics 23 • in-mem cache 23 • total: $138,000 • for 2 cores • + 22% annual support • Just in licenses...
  • 56. Alexis Lê-Quôc @alq General-purpose data store • Oracle • Postgres
  • 57. Alexis Lê-Quôc @alq Postgres • Volume of Data • High GBs, Low TBs • Latency • 10-100 ms after EXPLAIN ANALYZE • Ops • Low-maintenance, stable, predictable, replicated, boringly rock-solid • Dev • Well understood by (a certain class of) engineers • Cost, a function of storage latency
  • 58. Alexis Lê-Quôc @alq Not forgetting... • VoltDB • RAM-based, potentially a match for our DIRTy parts • Stored procedures, an acquired taste • Home-grow data stores (soon) • Rainbird • ...
  • 59. Alexis Lê-Quôc @alq The Data Mullet • All open-source, good if you’re ready to dive in code • $0 CAPEX • All OPEX on EC2
  • 60. Alexis Lê-Quôc @alq The Data Mullet on EC2 Structural Weakness: I/O latency at moderate throughputs
  • 61. One “bad” cassandra query Alexis Lê-Quôc @alq
  • 62. Clogging the I/O pipes on EC2 Alexis Lê-Quôc @alq Maximum Average Wait: up to 670 ms Maximum Service Time: up to 5 ms While writing 100 MB/s and reading 30 MB/s
  • 63. Alexis Lê-Quôc @alq Average wait in ms Transfer per seconds Consumer HD: ~75 tps SSD: 1-30 Ktps DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 03:35:02 PM dev8-80 380 24000 5.7 62 47 130 1.3 47 03:35:02 PM dev8-96 370 24000 5.6 63 46 120 1.2 45 03:35:02 PM dev8-112 380 24000 5.5 63 46 120 1.2 46 03:35:02 PM dev8-128 380 24000 7.2 63 56 150 1.3 50 Average service time in ms Read throughput in sector/s Total: 46 MB/s Another “Bad” Query
  • 64. Mitigation of I/O issues?
  • 65. Alexis Lê-Quôc @alq Cassandra: I/O Mitigation • More nodes, more RAM, more partitions, more parallelism • $$$
  • 66. Alexis Lê-Quôc @alq Postgres: I/O Mitigation • Scale up to a point • Replicate • Move to bare Metal => $$$ • A well-trodden but difficult path
  • 67. Alexis Lê-Quôc @alq Better yet... • Less dependency on low-latency, durable storage • Move more data to RAM (Redis) • Archive immutable data • S3/Cloudfront?
  • 68. Alexis Lê-Quôc @alq A digression: Your Very Own Chaos Monkey • Instances go bye-bye • https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/741224 • Instances go bye-bye, take 2 (high load) • https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/708920
  • 69. Alexis Lê-Quôc @alq Takeaway • By mixing and matching open-source SQL (PG) and NoSQL (Redis, Cassandra) Datadog has been able to quickly & simply get up-and-running with “$0” down payment on infrastructure.