SlideShare a Scribd company logo
Leon Torres 
October 15, 2014
Web Startup Challenges 
• Low-friction development 
• Hodgepodge of technologies 
• Hodgepodge of infrastructures 
• Legacy support 
• Constant migrations and upgrades 
• Bottom line: 
High rate of change and no time to check!
Distributed monitoring
A Gordian Knot 
• How utilized is our Hadoop cluster? 
• How utilized is our DC? 
• Are all of our services running correctly? 
• Is our latency OK at every layer in the stack? 
• Someone changed something, were there any 
negative ripple effects? 
• Are we hitting any scaling issues?
A Network Knot 
• Our products live on the internet 
• Our data centers are global 
– Some of them are virtual 
• Network effects are a fact of life 
– Network partitions 
– Latency makes information late 
– Noise is natural and frequent 
– Data just goes missing 
– High availability compounds the problem
Distributed monitoring
Distributed monitoring
– Richard W. Hamming
Solution Design 
• Hypothesize existence of 
system state 
a time varying stream of state components 
• Build it by measuring our systems in toto 
• Stream all measurements to one place 
• Gain insight by inspecting this stream 
computationally and ad-hoc
Separation of Concerns 
• State collection 
• State computation 
• State visualization
Collecting Sate 
• Define a state event ADT capturing: 
– Host 
– Service 
– State 
– Timestamp 
– Any additional key/value fields 
• Find something to collect it
Riemann 
• Riemann accepts state events as a stream 
• Riemann indexes the stream, provides stream 
processing facilities and some alerting tools 
• Also provides downstream pipes: 
– Unix domain sockets 
– Web sockets 
– Graphite stream comes free 
– Create your own
Innternal State Relays 
• Poll third party monitors for state 
• Map to Riemann events 
• Send to Riemann 
• Fill in holes with custom monitors 
– Hadoop jobs, load balancer state, etc. 
• Foundation in place to know everything about 
our global DC state
Network Monitors 
• Static monitors around the world 
– Constantly check HTTP state of services 
• Poll third party monitors (Pingdom, etc.) 
• Deduce network state from aggregate streams 
• Detect outages from user perspective 
• Can extend with phantomjs to get Gomez like 
waterfall and do whatever we want!
Demo Time 
• Ad hoc demo 
– Grep the stream 
– Quickly analyze state of disk utilization 
• Hadoop global state 
– It just pipes nagios data! 
• Network monitoring demo 
– Let’s combine pingdom + network monitors 
– And iterate! awesome dashboard
Distributed Gotchas 
• Riemann can scale, but some nasty surprises 
– Events on a TCP connection are processed serially 
– If event rate gets too high, stream gets saturated 
and backs up into OS network buffers, then into 
Netty’s unbounded buffers. This ultimately 
starves heap and crashes Riemann. 
– Solution is to use large connection pools at the 
clients that push events
Distributed Gotchas 
• Network outages and partitions are difficult 
– Riemann must not go down 
– Riemann must deal with split-brain 
• Highly available SRE solution planned 
– Virtual ip, heartbeat (similar to LB solution) 
• Riemann servers in separate locations 
– End up with two masters on partition => double 
the alerts but at least we get something
Are we cutting the knot?

More Related Content

PPTX
Autonomous workload rebalancing in kafka
PPTX
Apache Kafka : Monitoring vs Alerting
PDF
Stream Processing with Apache Flink
PDF
Unified NMS platform (Single OSS/M2000/Netact Platform)
PDF
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
PDF
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

PPTX
Spark Meetup:DataScience@Concur - Reacting to RT events to control throughput
PPTX
O'Reilly Webinar: Simplicity Scales - Big Data
Autonomous workload rebalancing in kafka
Apache Kafka : Monitoring vs Alerting
Stream Processing with Apache Flink
Unified NMS platform (Single OSS/M2000/Netact Platform)
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Spark Meetup:DataScience@Concur - Reacting to RT events to control throughput
O'Reilly Webinar: Simplicity Scales - Big Data

What's hot (20)

PPTX
PMIx: Bridging the Container Boundary
PDF
Introduction to Akka Serverless
PPTX
Free training on NCM - Discovery & Disaster recovery
PDF
Parameter Inconsistency and Auto Correction
PDF
ULMAN GUI Specifications
PPTX
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
PDF
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
PDF
Best Practices for Scaling an InfluxEnterprise Cluster
PPTX
Micro Strain Gauge Wireless Real Time Measurement
PPTX
Fifth draft
PPTX
Network and server performance monitoring training
PPTX
Kubernetes Infra 2.0
PPTX
Software defined network
PPTX
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
PPTX
Near rt ric tc
PPTX
Portable Streaming Pipelines with Apache Beam
PPTX
Webinar intro-to-central3.7-nov23-2016
PDF
Slick: A control plane for middleboxes
PDF
PLNOG 3: Kamil Grabowski - Jak stworzyc skuteczne NOC w Polsce
PPTX
Software-Defined Networking Layers presentation
PMIx: Bridging the Container Boundary
Introduction to Akka Serverless
Free training on NCM - Discovery & Disaster recovery
Parameter Inconsistency and Auto Correction
ULMAN GUI Specifications
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
Best Practices for Scaling an InfluxEnterprise Cluster
Micro Strain Gauge Wireless Real Time Measurement
Fifth draft
Network and server performance monitoring training
Kubernetes Infra 2.0
Software defined network
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
Near rt ric tc
Portable Streaming Pipelines with Apache Beam
Webinar intro-to-central3.7-nov23-2016
Slick: A control plane for middleboxes
PLNOG 3: Kamil Grabowski - Jak stworzyc skuteczne NOC w Polsce
Software-Defined Networking Layers presentation
Ad

Similar to Distributed monitoring (20)

PPTX
Play With Streams
PDF
John adams talk cloudy
PDF
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
PDF
Instrumenting the real-time web: Node.js in production
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Overcoming Variable Payloads to Optimize for Performance
PPTX
Kinesis @ lyft
PPTX
Performance Comparison of Streaming Big Data Platforms
PPT
Fdp embedded systems
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
PDF
Building Big Data Streaming Architectures
PPTX
Samza tech talk_2015 - strata
PPSX
Building Modern Digital Services on Scalable Private Government Infrastructur...
PDF
Building a Database for the End of the World
PPTX
The Need for Complex Analytics from Forwarding Pipelines
PDF
Tv and video on the Internet
PPTX
OpenStack: Virtual Routers On Compute Nodes
Play With Streams
John adams talk cloudy
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Instrumenting the real-time web: Node.js in production
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Overcoming Variable Payloads to Optimize for Performance
Kinesis @ lyft
Performance Comparison of Streaming Big Data Platforms
Fdp embedded systems
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Building Big Data Streaming Architectures
Samza tech talk_2015 - strata
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building a Database for the End of the World
The Need for Complex Analytics from Forwarding Pipelines
Tv and video on the Internet
OpenStack: Virtual Routers On Compute Nodes
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Modernizing your data center with Dell and AMD
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Modernizing your data center with Dell and AMD
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Distributed monitoring

  • 2. Web Startup Challenges • Low-friction development • Hodgepodge of technologies • Hodgepodge of infrastructures • Legacy support • Constant migrations and upgrades • Bottom line: High rate of change and no time to check!
  • 4. A Gordian Knot • How utilized is our Hadoop cluster? • How utilized is our DC? • Are all of our services running correctly? • Is our latency OK at every layer in the stack? • Someone changed something, were there any negative ripple effects? • Are we hitting any scaling issues?
  • 5. A Network Knot • Our products live on the internet • Our data centers are global – Some of them are virtual • Network effects are a fact of life – Network partitions – Latency makes information late – Noise is natural and frequent – Data just goes missing – High availability compounds the problem
  • 8. – Richard W. Hamming
  • 9. Solution Design • Hypothesize existence of system state a time varying stream of state components • Build it by measuring our systems in toto • Stream all measurements to one place • Gain insight by inspecting this stream computationally and ad-hoc
  • 10. Separation of Concerns • State collection • State computation • State visualization
  • 11. Collecting Sate • Define a state event ADT capturing: – Host – Service – State – Timestamp – Any additional key/value fields • Find something to collect it
  • 12. Riemann • Riemann accepts state events as a stream • Riemann indexes the stream, provides stream processing facilities and some alerting tools • Also provides downstream pipes: – Unix domain sockets – Web sockets – Graphite stream comes free – Create your own
  • 13. Innternal State Relays • Poll third party monitors for state • Map to Riemann events • Send to Riemann • Fill in holes with custom monitors – Hadoop jobs, load balancer state, etc. • Foundation in place to know everything about our global DC state
  • 14. Network Monitors • Static monitors around the world – Constantly check HTTP state of services • Poll third party monitors (Pingdom, etc.) • Deduce network state from aggregate streams • Detect outages from user perspective • Can extend with phantomjs to get Gomez like waterfall and do whatever we want!
  • 15. Demo Time • Ad hoc demo – Grep the stream – Quickly analyze state of disk utilization • Hadoop global state – It just pipes nagios data! • Network monitoring demo – Let’s combine pingdom + network monitors – And iterate! awesome dashboard
  • 16. Distributed Gotchas • Riemann can scale, but some nasty surprises – Events on a TCP connection are processed serially – If event rate gets too high, stream gets saturated and backs up into OS network buffers, then into Netty’s unbounded buffers. This ultimately starves heap and crashes Riemann. – Solution is to use large connection pools at the clients that push events
  • 17. Distributed Gotchas • Network outages and partitions are difficult – Riemann must not go down – Riemann must deal with split-brain • Highly available SRE solution planned – Virtual ip, heartbeat (similar to LB solution) • Riemann servers in separate locations – End up with two masters on partition => double the alerts but at least we get something
  • 18. Are we cutting the knot?

Editor's Notes

  • #4: At no point can we sit down and sift through our architecture and say this situation is an error and that situation is ok. We cannot just classify things like that because they become defunct within a month and sometimes within days. OK, we can do it for certain things, but for most application level stuff we have no way to do it. We have to somehow monitor *everything* and figure out how we can know what went wrong from that. Note that this requires us to be experts at every level of the system, as Bilke covered last presentation.
  • #5: Let’s take a look at some things we may want to know. These are some gnarly, but super important questions.
  • #6: Our life is complicated by the distributed nature of our systems, so we need to ensure that whatever solution we have takes into account the network.
  • #7: Here are some existing solutions we have tried over the years..
  • #8: However, our experience is that these do not work. They each solve different problems, sometimes very well, but they all fail to answer the knotty questions about the overall system. We have to drill down into many of these applications to get an idea of what the heck is going on. I don’t know about you, but I’m getting log-in fatigue whenever a problem happens. And the situation is getting worse with all these pay-ware hosted third party solutions. So is there a better way? We need to clear our minds of these approaches and look at the fundamental problem from a fresh perspective.
  • #9: If we really get back to the basics, we’re talking information theory, computer science, really thinking about the problem as far down as we need to go. And I’m not being academic. Hamming’s quote illustrates a highly pragmatic wisdom despite his heavily mathematical work. It’s also quite on topic: We will take a deep look at what we’re really trying to do here, to come up with some solution design that considers our desire for insight and how we can piece it numerically from our chaotic mess of systems, people and processes.
  • #10: Each of the existing tools we just swiped off the table purported to yield insight from some data, but they somehow failed to tell us what we need to know: the state of our system. Let’s look at a solution design that involves the so called state of our system. (read slide) Now much of this was motivated by a project called Riemann, which was designed by a Physics nerd. In science when you model something, you choose to represent the system as state vectors in some convenient topological space, and then you run gnarly computations to see if the model matches reality. This is a powerful approach that has consistently yielded great insights on the nature of the universe. We will repeat this process here because hey, our computer systems are a subset of the universe.
  • #11: This makes it straightforward to implement, debug, scale and maintain.
  • #13: The point of all this is to be able to operate on the stream as needed. Note that you don’t need to write clojure code to do this, you can simply open a socket and stream it into python or whatever. Later on there will be demos that I cobbled together using javascript over websockets.
  • #14: What about monitoring the data center? It turns out we don’t have to re-invent the wheel. Each monitoring system like nagios and new relic have API which allows us to poll the state and map it to Riemann friendly events. This is great because we can leverage existing expertise of monitoring systems and get a huge return right off the bat.
  • #15: Pingdom is great, but it lacks some features, such as telling us what the network state is in general. We can deduce the network state by creating our own series of monitors. This also gives us a platform to replicate the latency waterfall for web pages as done by Gomez and Akamai.
  • #16: Demo time
  • #17: I wrote something about Riemann java client being lousy. The network monitors have to reconnect on timeout, but it wasn’t supported. So I implemented my own connection logic with one TCP connection and ended up getting burned rather nicely by this. So now I have to contribute to the java client or roll my own. Exciting stuff!
  • #19: It’s too soon to say, but I have been using this system during recent outages and it’s starting to look quite useful. We can expect the a follow up to cover the problem of insight and whether this kind of streaming state processor helps at all. There are some additional preliminary and exciting ideas that I haven’t covered here. It’s shaping up to be an interesting body of work Finally, who would have known: monitoring seems like such a dry topic, until you realize it’s actually very deep.