Distributed monitoring

Web Startup Challenges
• Low-friction development
• Hodgepodge of technologies
• Hodgepodge of infrastructures
• Legacy support
• Constant migrations and upgrades
• Bottom line:
High rate of change and no time to check!

A Gordian Knot
• How utilized is our Hadoop cluster?
• How utilized is our DC?
• Are all of our services running correctly?
• Is our latency OK at every layer in the stack?
• Someone changed something, were there any
negative ripple effects?
• Are we hitting any scaling issues?

A Network Knot
• Our products live on the internet
• Our data centers are global
– Some of them are virtual
• Network effects are a fact of life
– Network partitions
– Latency makes information late
– Noise is natural and frequent
– Data just goes missing
– High availability compounds the problem

Solution Design
• Hypothesize existence of
system state
a time varying stream of state components
• Build it by measuring our systems in toto
• Stream all measurements to one place
• Gain insight by inspecting this stream
computationally and ad-hoc

Separation of Concerns
• State collection
• State computation
• State visualization

Collecting Sate
• Define a state event ADT capturing:
– Host
– Service
– State
– Timestamp
– Any additional key/value fields
• Find something to collect it

Riemann
• Riemann accepts state events as a stream
• Riemann indexes the stream, provides stream
processing facilities and some alerting tools
• Also provides downstream pipes:
– Unix domain sockets
– Web sockets
– Graphite stream comes free
– Create your own

Innternal State Relays
• Poll third party monitors for state
• Map to Riemann events
• Send to Riemann
• Fill in holes with custom monitors
– Hadoop jobs, load balancer state, etc.
• Foundation in place to know everything about
our global DC state

Network Monitors
• Static monitors around the world
– Constantly check HTTP state of services
• Poll third party monitors (Pingdom, etc.)
• Deduce network state from aggregate streams
• Detect outages from user perspective
• Can extend with phantomjs to get Gomez like
waterfall and do whatever we want!

Demo Time
• Ad hoc demo
– Grep the stream
– Quickly analyze state of disk utilization
• Hadoop global state
– It just pipes nagios data!
• Network monitoring demo
– Let’s combine pingdom + network monitors
– And iterate! awesome dashboard

Distributed Gotchas
• Riemann can scale, but some nasty surprises
– Events on a TCP connection are processed serially
– If event rate gets too high, stream gets saturated
and backs up into OS network buffers, then into
Netty’s unbounded buffers. This ultimately
starves heap and crashes Riemann.
– Solution is to use large connection pools at the
clients that push events

Distributed Gotchas
• Network outages and partitions are difficult
– Riemann must not go down
– Riemann must deal with split-brain
• Highly available SRE solution planned
– Virtual ip, heartbeat (similar to LB solution)
• Riemann servers in separate locations
– End up with two masters on partition => double
the alerts but at least we get something

Distributed monitoring

More Related Content

What's hot (20)

Similar to Distributed monitoring (20)

Recently uploaded (20)

Distributed monitoring

Editor's Notes