Production Ready Microservices at Scale

Production Ready
Microservices
at scale
Rajeev N B

@rBharshetty

GO-JEK

Transport, Logistics, Hyperlocal
delivery and Payments
• 18 Products
• 1m+ Drivers
• 500+ Microservices
• 15k+ Cores
• 2 Cloud Providers
• 6 Data centers
• 100+ million bookings per month

Agenda
• What are Production Ready Microservices ?

• Why do we need them ?

• How do we build them ?

What is Production Ready ?
• Stable

• Reliable

• Scalable

• Performant

• Fault Tolerant

• Monitored

• Prepared for Catastrophe

• Secure

Why ?
• Goal is Availability

• Trust services you put in Production

• Organisational Sprawl

• Increased Development Velocity

• Technical Sprawl

Production
Readiness Checklist

Linting/
Formatting
• Statically analyse linting and
formatting errors

• Follow Ruby Style Guide

• Helps maintain Code sanity

• Use tools like Rubocop
• `rubocop -a` to autoﬁx certain
errors

Production Ready Microservices at Scale

Code Smells
• Statically analyse code for code
smells like Long Parameter list,
Cyclomatic Complexity,
Uncommunicative name etc
• Easy to change codebases

• Use tools like Reek

Secure coding
• Statically analyse code for Security
vulnerabilities
• Identify security issues like XSS,
CSRF, SQL Injection etc

• Use tools like Brakeman
• Identify issues with various
conﬁdence levels

#1. Code Quality
• Make it part of your development
process

• Also deployment pipeline

• Helps avoid bugs creeping in the
code

• Helps achieve Stability and
Reliablity of microservice

• Also helps achieve security

Unit/Integration Testing
• Unit tests test a unit/function (Individual component)

• Integration tests help in testing components working together

• Contract testing helps test interactions of your micro service with its
dependencies

Load Testing
• Expose to higher traﬃc loads to see behaviour

• Makes sure that microservice is ready to accept increased traﬃc in
future

• Helps identify scalability challenges and bottlenecks

• Helps in determining SLA of a service

• Load testing at GO-JEK using Gatling

Chaos Testing
• Helps determine Unknown Unknowns
• Type of Resiliency testing

• Break systems to understand behaviour

• At GO-JEK ? Manual

• Automated ? Simian Army

#2. Testing
• Can be automated

• Make it part of your development
process

• Also deployment pipeline

• Builds conﬁdence in code

• Helps with Fault tolerance,
Scalability, Performance, Stability
and Reliability

Timeouts/Retries
• Stop waiting for an answer after certain time - Timeout
• Timeout - Fail Fast

• Retry the request after failure

• Retry - Succeed eventually

Examples - ‘net/http’
http = Net::HTTP.new(host, port)
http.open_timeout = 1
http.read_timeout = 1

Circuit Breaker
• Circumvent calls to dependency when calls fail

• Fallback to a fallback response
• Hystrix is a well known implementation

• jruby-hystrix used internally at GO-JEK (Yet to be open sourced)

module CustomerClient
module Hystrix
class HttpCommand < ::Hystrix::HystrixBaseCommand
SERVER_DOWN_HTTP_STATUS_CODES = [500, 501, 502, 503, 504]
def run
http_response = nil
begin
http_response = request_uri.send(request_method, request_params, request_headers) do|
callback|
callback.on(400..511) do |response|
Wrest.logger.warn("<- #{request_method} #{request_uri}")
Wrest.logger.warn("code: #{response.code}: body: #{response.deserialised_body}")
end
end
if SERVER_DOWN_HTTP_STATUS_CODES.include?(http_response.code.to_i)
raise "Server Error with response code #{http_response.code.to_i}"
end
build_response(http_response)
rescue => e
raise HttpCommandException.new(request_uri, request_headers, request_params,
http_response), e.message
end
end
end
end
end
jruby-hystrix

#3. Resiliency
Patterns
• Protect systems from failures of its
dependencies

• Protect dependency when it has
started to fail

• Fail Fast and Succeed Eventually

• Helps with Fault tolerance,
Scalability, Stability and Reliability

Logging
• Describes state of the micro service

• Helps debugging root cause of an issue

• Structured logging

• Logging needs to be scalable, available and easily accessible (Log
centralisation)

• Logging in Ruby

Instrumentation in Code
• Microservice Key metrics (Latencies, errors, api responses etc) -
newrelic_rpm

• StatsD metrics (statsd-instrument) - Custom metrics
• Exception and error tracking metrics (sentry-raven)

StatsD
Initialisation:
`StatsD.backend = case Rails.env
when 'production', 'staging'
StatsD::Instrument::Backends::UDPBackend.new(“local:8125”, :datadog)
when 'test'
StatsD::Instrument::Backends::NullBackend.new
else
StatsD::Instrument::Backends::LoggerBackend.new(StatsD.logger)
end`
Counter:
StatsD.increment(‘customer.login.device.ratelimit.count’)`
Gauge:
StatsD.gauge(‘booking.queued', 12, sample_rate: 1.0)

Dashboards
• Capture Key metrics

• USE (Utilisation, Saturation, Errors)

• RED (Request Rate, Error Rate, Duration)

• CPU, RAM, database connections, latencies etc

• Should represent Health of a service

• Grafana/NewRelic dashboards at GO-JEK

Alerting
• Alert when failures are detected or changes in key metrics

• Alerts should be actionable
• Answered by people on call/support
• Helps protect availability of systems before they go down

• Alerting at GO-JEK

#4. Observability
• Provides insights into how a service
is behaving and its overall health
• Includes 3 main categories:
Logging, Monitoring and
Distributed Tracing
• Helps in quick incidence response
during outages

• Helps maintain Stability of the
service

CI/CD
• Increased Deployment Velocity - 30+ deploys a day

• Bad Deployments major cause of downtimes

• Rollout - Staging -> Production
• Rolling Deploys - Incremental
• Think of Rollbacks at every deployment

Canary
• Deploy the new change to single node or subset of total nodes

• Monitor key metrics for certain duration

• Promote Canary to Production if all good or else Rollback
• Helps minimise impact of bad deployments in Production

#5. CI/CD
• Lint/Test/Build/Package/Deploy
steps

• Canaries help have minimal
impacts on Bad deployments

• Rollbacks to minimise impact
• Helps Provide Stability and
Reliability

Always being
available is the Goal

More …
• Documentation

• Security

• Capacity Planning

• Outages and Incidence Response (Runbooks)

Future work
• Implementing Production Readiness checklist on the ground

• Review process for micro services

• Production Readiness score

• Automating the checklist

References
• Production Ready Microservices - Susan Fowler

• Google SRE Book

• Microservices Standardisation

• SLA

Thanks for listening
Questions ?
@rBharshetty
@gojektech

Production Ready Microservices at Scale

More Related Content

What's hot (20)

Similar to Production Ready Microservices at Scale (20)

More from Rajeev Bharshetty (6)

Recently uploaded (20)

Production Ready Microservices at Scale