SlideShare a Scribd company logo
A Journey In The Public Clouds
       With Datadog

    Alexis Lê-Quôc (Product Guy) at Datadog
             IASA New York Chapter
                 June 28th, 2011
What I’m going to talk about
 ‣What we do and for whom
 ‣The kind of data we deal with
 ‣Our architecture
 ‣Our architecture in a public cloud (AWS)
 ‣What we learned
 ‣Q+A
SaaS Platform for
Aggregation, Correlation, Collaboration
           For Dev & Ops




            What we do?
The Mess
                                                                                                        Usage Analytics
                                                                                                                                                         Too many data streams,
                                                                  IAAS / PAAS
                                                                                                                                                             too many silos
                                                                                                                               Issue Resolution

                                                                                                               t
                                                      ics
 Servers and Devices
                                                                          ics                              igh


                                                  ices
                                                                       etr                              ins
                                              metr

                                                    g
                                             billin                                                                                                       Too many choices to
                                     m                             m
                                             cho
                                       et
                                           ri c                                                                    s
                                               s
                                                            ?!?                                             change                                          make, too often
                                                                                    Dev team



                       changes                                    !?
                                                                                          ics           choices
                                                                                  metr
                                               Ops team                                                                                  Applications

                          tri
                              cs                                                                      ch
                                                                                                         an
                                                                                                                                                          Only getting worse as
                       me
                                 nts                                                                        ge
                                                                                                                                                           SaaS Silos multiply
me




                                                                                even                           s
                             ve                                                      ts
tri




                                                                  ad




                           e                                                              + fe
                                      es                                                        edb
 cs




                                                                    vic




                                  oic                                                                ack
                               ch
                                                                       e
                                                            me
                                                     s
                                           s
                                      tric
                                                   choice


                                                            tri
                                    me




                                                             cs




                                                                                                                                                          Separate Dev and Ops
                                                                                     Cap. Planning                        SDLC support

  Monitoring

                                                                                                                                                        teams, looking at separate
                                               Hosting
                                                                                                                                                              data streams
                                                                                                                                Asset Mgmt
                                                                                   CDNs




                                  Data-Driven decision making in IT is rarely happening.
                                      Too slow, Too expensive, requires too much discipline.
We Simplify
Datadog to the rescue
                system metrics
                                    key metrics
               quality metrics     to Alice Dev

                  SaaS data




                                                      visibility
               capacity metrics

               usage analytics
                                  recommendations
                cloud billing        to Bob Ops

                code metrics




                                                       visibility
               config changes

                 IaaS pricing
                                   business metrics
                  perf. data       to Charlie CEO

                vendors info

               curated metadata
 Aggregation   Correlation        Collaboration
Concretely
etc.
       Aggregation
AGGREGATION
        Aggregation
https://app.datad0g.com/dash/dash/1000#/date_range/1308057152698-1308143552698
                                                                                 Correlation
Collaboration
What Architecture For
 What Kind Of Data?
Events          Metrics
User comments   Unique visitors
Alert           Load
Build           Transaction duration
Batch job       etc.
Taxonomy
Atomicity
Concistency
Isolation
Durability

e.g. SQL DBs



           CLASSICS
        http://en.wikipedia.org/wiki/Eventual_consistency
Atomicity                                    Basically
Concistency                                  Available
Isolation                                    Soft-state
Durability                                   Eventual
                                             consistency
e.g. SQL DBs
                                             e.g. DNS


           CLASSICS
        http://en.wikipedia.org/wiki/Eventual_consistency
Data
      Intensive
      Real
      Time

      e.g. real-time web


NEW COMER
Brian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf
Aggregation
Constant data influx
Large data sets

              Correlation
              On-demand visualization
              Background data analysis

                             Collaboration
                             Real-time updates
                             On-the-fly data analysis
Aggregation

    SE
Constant data influx
  BA
Large data sets

              Correlation
              On-demand visualization
              Background data analysis

                             Collaboration
                             Real-time updates
                             On-the-fly data analysis
Aggregation

    SE


             T
Constant data influx


           IR
  BA


          D
Large data sets

              Correlation
              On-demand visualization
              Background data analysis

                             Collaboration
                             Real-time updates
                             On-the-fly data analysis
Aggregation

    SE


             T
Constant data influx


           IR
  BA


          D
Large data sets

              Correlation




                        SE
              On-demand visualization


                      BA
              Background data analysis

                             Collaboration
                             Real-time updates
                             On-the-fly data analysis
Aggregation

    SE


             T
Constant data influx


           IR
  BA


          D
Large data sets

              Correlation




                        SE
              On-demand visualization


                      BA
              Background data analysis

                             Collaboration




                                        T
                             Real-time updates




                                      IR
                                     D
                             On-the-fly data analysis
Aggregation

    SE


             T
Constant data influx


           IR
  BA


          D
Large data sets

              Correlation




                        SE
              On-demand visualization


                      BA
              Background data analysis

                             Collaboration




                                        T
                             Real-time updates




                                      IR
                                     D
                             On-the-fly data analysis

  Datadog = DIRT + BASE + a tiny bit of ACID
How It All Fits Together
    http://www.flickr.com/photos/tom-margie/1253798184/
Architecture
   Simplified
Architecture
       Simplified




  SE
BA
Architecture
              Simplified




         SE
   T
 IR


       BA
D
Architecture
              Simplified




         SE



                ID
   T
 IR




               C
       BA



              A
D
The Environment
4 Dimensions
Compute
Storage
Network
Management
ON-PREMISE TRAITS
http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
Compute
Fast
Inelastic




       ON-PREMISE TRAITS
        http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
Compute
Fast
Inelastic




Storage
Fast
Centralized
Redundant

         ON-PREMISE TRAITS
          http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
Compute                                                                               Network
Fast                                                                                  Fast
Inelastic                                                                             Localized




Storage
Fast
Centralized
Redundant

         ON-PREMISE TRAITS
          http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
Compute                                                                               Network
Fast                                                                                  Fast
Inelastic                                                                             Localized




Storage
Fast                                                                       Management
Centralized                                                                People-based
Redundant                                                                  Full access

         ON-PREMISE TRAITS
          http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
CLOUD TRAITS
Compute
Slow
Elastic




          CLOUD TRAITS
Compute
Slow
Elastic




Storage
Slow
Jittery
Maybe durable
Low memory

                CLOUD TRAITS
Compute                    Network
Slow                       “Fast”
Elastic                    Geo-distributed




Storage
Slow
Jittery
Maybe durable
Low memory

                CLOUD TRAITS
Compute                    Network
Slow                       “Fast”
Elastic                    Geo-distributed




Storage
Slow
Jittery                   Management
Maybe durable             No bare-metal
Low memory                “Magic” API

                CLOUD TRAITS
What We Have
   Found
Network
Network
Layer 2: Virtual Domain
Layer 3: Crude Edge Filtering
Layer 7: Crude Load Balancing
DNS
CDN
Network
Layer 2: Virtual Domain




                !
Layer 3: Crude Edge Filtering


              ks
           or
Layer 7: Crude Load Balancing
DNS
          W
        It
CDN
Storage
Latency

                                     BASE
                                     Amazon S3


                       BASE
                       Apache Cassandra
          ACID
          PostgreSQL
   DIRT
   Redis
                                            Capacity

                  Storage
Latency

                                      BASE




                                            y
                                           nc
                                      Amazon S3




                                           te
                                       La
                                t
                        BASE




                                pu
                    y

                             gh
                  er
                        Apache Cassandra


                           ou
           ACID  tt

                           hr
               Ji

                        dt
           PostgreSQL
                    i te
                 Lim

   DIRT
           y
          or
      em




   Redis
                                                Capacity
    m
  w
Lo




                    Storage
Low Memory
 http://aws.amazon.com/ec2/#instance
Jittery, Limited Throughput
          Network Block Storage (EBS)

  https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016
Average wait in ms

                     DEV      tps   rd_sec/s   wr_sec/s   avgrq-sz   avgqu-sz    await   svctm   %util
03:35:02   PM    dev8-80   375.95   23614.08       5.70      62.83      47.21   125.58    1.26   47.34
03:35:02   PM    dev8-96   373.63   23749.65       5.64      63.58      45.55   121.91    1.22   45.72
03:35:02   PM   dev8-112   375.28   23693.47       5.52      63.15      45.52   121.22    1.23   46.31
03:35:02   PM   dev8-128   375.31   23721.57       7.19      63.22      56.00   148.96    1.34   50.35




                Read throughput in sector/s                                     Average service
                      Total: 368Mb/s                                              time in ms

   Limited Throughput In Numbers
                      RAID 0 EBS Volumes, m1.large instances
Some Tricks
Software RAID
RAID 0
Offsite backups




              Some Tricks
Software RAID       Limited by slowest
RAID 0              volume
Offsite backups




              Some Tricks
Software RAID           Limited by slowest
RAID 0                  volume
Offsite backups




Streaming replication
S3 backups




              Some Tricks
Software RAID           Limited by slowest
RAID 0                  volume
Offsite backups

Ephemeral volumes
And Offsite backups

Streaming replication
S3 backups




              Some Tricks
Software RAID           Limited by slowest
RAID 0                  volume
Offsite backups

Ephemeral volumes
And Offsite backups     Complexity
                        Recovery Time Objective
Streaming replication   Recovery Point Objective
S3 backups




              Some Tricks
Software RAID           Limited by slowest
RAID 0                  volume
Offsite backups

Ephemeral volumes
And Offsite backups     Complexity
                        Recovery Time Objective
Streaming replication   Recovery Point Objective
S3 backups

Database Service
MySQL/Oracle RDS

              Some Tricks
Software RAID           Limited by slowest
RAID 0                  volume
Offsite backups

Ephemeral volumes
And Offsite backups     Complexity
                        Recovery Time Objective
Streaming replication   Recovery Point Objective
S3 backups

Database Service        Trust
MySQL/Oracle RDS        RDS Outage 2 months ago

              Some Tricks
Network Block Storage
 Is The Dark Side
Network Block Storage
 Is The Dark Side

 Bait For Enterprise
    Customers
Network Block Storage
    Is The Dark Side

    Bait For Enterprise
       Customers


Hard Problem For
 Cloud Providers
Don’t rely on networked block storage
Small data sets only if you have to

Don’t trust data-at-rest
Copy, replicate, back up

Do use S3 if you can
Object semantics a limitation
Slow but durable



       Some Do’s And Don’t
Compute
“Performance”
      Scale up   Shard


       ACID
       Nodes



                 BASE DIRT Add more
                 Nodes Nodes
                                      Number

                 Compute
Don’t rely on scale-ups
Low memory a hard limit for DBs
Noisy neighbors
Individual performance poor and jittery

Scale out
First scale up
Then Shard
Parallelize across machines
Vector-processing via GPUs


       Some Do’s And Don’t
Management
An API for everything
Compute
Storage
Network
Management
Do use the AWS APIs
Almost like magic
Rich libraries
Ever expanding

Do use tools
e.g. Chef, Puppet, cfengine, etc.
Datadog

Do Kill and Respawn
Low-level debugging impossible
Instance creation is cheap

Some Do’s And Don’t
New Rules
New Tools
New Playbook

Same Fundamentals
Questions!

http://datadoghq.com
      twitter: @alq

More Related Content

PDF
Datadog at NYCBUG
KEY
Beyond Nagios
PDF
Datadog jawsdays2017 lunch_lt
PDF
DevOpsDays Ignite: Ops Scrumban, from chaos to sanity
ODP
Large-scale, cross-platform synchronization using embedded python
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
PDF
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Datadog at NYCBUG
Beyond Nagios
Datadog jawsdays2017 lunch_lt
DevOpsDays Ignite: Ops Scrumban, from chaos to sanity
Large-scale, cross-platform synchronization using embedded python
2024 Trend Updates: What Really Works In SEO & Content Marketing
Storytelling For The Web: Integrate Storytelling in your Design Process
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Tartificialntelligence_presentation.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Approach and Philosophy of On baking technology
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
1. Introduction to Computer Programming.pptx
PDF
August Patch Tuesday
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Group 1 Presentation -Planning and Decision Making .pptx
Mushroom cultivation and it's methods.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Univ-Connecticut-ChatGPT-Presentaion.pdf
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Tartificialntelligence_presentation.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative study of natural language inference in Swahili using monolingua...
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
A comparative analysis of optical character recognition models for extracting...
1. Introduction to Computer Programming.pptx
August Patch Tuesday
Web App vs Mobile App What Should You Build First.pdf
1 - Historical Antecedents, Social Consideration.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Ad
Ad

A journey in the public clouds

  • 1. A Journey In The Public Clouds With Datadog Alexis Lê-Quôc (Product Guy) at Datadog IASA New York Chapter June 28th, 2011
  • 2. What I’m going to talk about ‣What we do and for whom ‣The kind of data we deal with ‣Our architecture ‣Our architecture in a public cloud (AWS) ‣What we learned ‣Q+A
  • 3. SaaS Platform for Aggregation, Correlation, Collaboration For Dev & Ops What we do?
  • 4. The Mess Usage Analytics Too many data streams, IAAS / PAAS too many silos Issue Resolution t ics Servers and Devices ics igh ices etr ins metr g billin Too many choices to m m cho et ri c s s ?!? change make, too often Dev team changes !? ics choices metr Ops team Applications tri cs ch an Only getting worse as me nts ge SaaS Silos multiply me even s ve ts tri ad e + fe es edb cs vic oic ack ch e me s s tric choice tri me cs Separate Dev and Ops Cap. Planning SDLC support Monitoring teams, looking at separate Hosting data streams Asset Mgmt CDNs Data-Driven decision making in IT is rarely happening. Too slow, Too expensive, requires too much discipline.
  • 5. We Simplify Datadog to the rescue system metrics key metrics quality metrics to Alice Dev SaaS data visibility capacity metrics usage analytics recommendations cloud billing to Bob Ops code metrics visibility config changes IaaS pricing business metrics perf. data to Charlie CEO vendors info curated metadata Aggregation Correlation Collaboration
  • 7. etc. Aggregation
  • 8. AGGREGATION Aggregation
  • 11. What Architecture For What Kind Of Data?
  • 12. Events Metrics User comments Unique visitors Alert Load Build Transaction duration Batch job etc.
  • 14. Atomicity Concistency Isolation Durability e.g. SQL DBs CLASSICS http://en.wikipedia.org/wiki/Eventual_consistency
  • 15. Atomicity Basically Concistency Available Isolation Soft-state Durability Eventual consistency e.g. SQL DBs e.g. DNS CLASSICS http://en.wikipedia.org/wiki/Eventual_consistency
  • 16. Data Intensive Real Time e.g. real-time web NEW COMER Brian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf
  • 17. Aggregation Constant data influx Large data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  • 18. Aggregation SE Constant data influx BA Large data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  • 19. Aggregation SE T Constant data influx IR BA D Large data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  • 20. Aggregation SE T Constant data influx IR BA D Large data sets Correlation SE On-demand visualization BA Background data analysis Collaboration Real-time updates On-the-fly data analysis
  • 21. Aggregation SE T Constant data influx IR BA D Large data sets Correlation SE On-demand visualization BA Background data analysis Collaboration T Real-time updates IR D On-the-fly data analysis
  • 22. Aggregation SE T Constant data influx IR BA D Large data sets Correlation SE On-demand visualization BA Background data analysis Collaboration T Real-time updates IR D On-the-fly data analysis Datadog = DIRT + BASE + a tiny bit of ACID
  • 23. How It All Fits Together http://www.flickr.com/photos/tom-margie/1253798184/
  • 24. Architecture Simplified
  • 25. Architecture Simplified SE BA
  • 26. Architecture Simplified SE T IR BA D
  • 27. Architecture Simplified SE ID T IR C BA A D
  • 31. Compute Fast Inelastic ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  • 32. Compute Fast Inelastic Storage Fast Centralized Redundant ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  • 33. Compute Network Fast Fast Inelastic Localized Storage Fast Centralized Redundant ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  • 34. Compute Network Fast Fast Inelastic Localized Storage Fast Management Centralized People-based Redundant Full access ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  • 36. Compute Slow Elastic CLOUD TRAITS
  • 38. Compute Network Slow “Fast” Elastic Geo-distributed Storage Slow Jittery Maybe durable Low memory CLOUD TRAITS
  • 39. Compute Network Slow “Fast” Elastic Geo-distributed Storage Slow Jittery Management Maybe durable No bare-metal Low memory “Magic” API CLOUD TRAITS
  • 40. What We Have Found
  • 42. Network Layer 2: Virtual Domain Layer 3: Crude Edge Filtering Layer 7: Crude Load Balancing DNS CDN
  • 43. Network Layer 2: Virtual Domain ! Layer 3: Crude Edge Filtering ks or Layer 7: Crude Load Balancing DNS W It CDN
  • 45. Latency BASE Amazon S3 BASE Apache Cassandra ACID PostgreSQL DIRT Redis Capacity Storage
  • 46. Latency BASE y nc Amazon S3 te La t BASE pu y gh er Apache Cassandra ou ACID tt hr Ji dt PostgreSQL i te Lim DIRT y or em Redis Capacity m w Lo Storage
  • 48. Jittery, Limited Throughput Network Block Storage (EBS) https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016
  • 49. Average wait in ms DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.34 03:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.72 03:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.31 03:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35 Read throughput in sector/s Average service Total: 368Mb/s time in ms Limited Throughput In Numbers RAID 0 EBS Volumes, m1.large instances
  • 51. Software RAID RAID 0 Offsite backups Some Tricks
  • 52. Software RAID Limited by slowest RAID 0 volume Offsite backups Some Tricks
  • 53. Software RAID Limited by slowest RAID 0 volume Offsite backups Streaming replication S3 backups Some Tricks
  • 54. Software RAID Limited by slowest RAID 0 volume Offsite backups Ephemeral volumes And Offsite backups Streaming replication S3 backups Some Tricks
  • 55. Software RAID Limited by slowest RAID 0 volume Offsite backups Ephemeral volumes And Offsite backups Complexity Recovery Time Objective Streaming replication Recovery Point Objective S3 backups Some Tricks
  • 56. Software RAID Limited by slowest RAID 0 volume Offsite backups Ephemeral volumes And Offsite backups Complexity Recovery Time Objective Streaming replication Recovery Point Objective S3 backups Database Service MySQL/Oracle RDS Some Tricks
  • 57. Software RAID Limited by slowest RAID 0 volume Offsite backups Ephemeral volumes And Offsite backups Complexity Recovery Time Objective Streaming replication Recovery Point Objective S3 backups Database Service Trust MySQL/Oracle RDS RDS Outage 2 months ago Some Tricks
  • 58. Network Block Storage Is The Dark Side
  • 59. Network Block Storage Is The Dark Side Bait For Enterprise Customers
  • 60. Network Block Storage Is The Dark Side Bait For Enterprise Customers Hard Problem For Cloud Providers
  • 61. Don’t rely on networked block storage Small data sets only if you have to Don’t trust data-at-rest Copy, replicate, back up Do use S3 if you can Object semantics a limitation Slow but durable Some Do’s And Don’t
  • 63. “Performance” Scale up Shard ACID Nodes BASE DIRT Add more Nodes Nodes Number Compute
  • 64. Don’t rely on scale-ups Low memory a hard limit for DBs Noisy neighbors Individual performance poor and jittery Scale out First scale up Then Shard Parallelize across machines Vector-processing via GPUs Some Do’s And Don’t
  • 66. An API for everything Compute Storage Network Management
  • 67. Do use the AWS APIs Almost like magic Rich libraries Ever expanding Do use tools e.g. Chef, Puppet, cfengine, etc. Datadog Do Kill and Respawn Low-level debugging impossible Instance creation is cheap Some Do’s And Don’t
  • 68. New Rules New Tools New Playbook Same Fundamentals