SlideShare a Scribd company logo
Production Ready
Microservices
at scale
Rajeev N B

@rBharshetty 

GO-JEK
Transport, Logistics, Hyperlocal
delivery and Payments
• 18 Products
• 1m+ Drivers
• 500+ Microservices
• 15k+ Cores
• 2 Cloud Providers
• 6 Data centers
• 100+ million bookings per month
Agenda
• What are Production Ready Microservices ?

• Why do we need them ?

• How do we build them ?
What is Production Ready ?
• Stable

• Reliable

• Scalable

• Performant

• Fault Tolerant

• Monitored

• Prepared for Catastrophe

• Secure
Why ?
• Goal is Availability

• Trust services you put in Production

• Organisational Sprawl

• Increased Development Velocity

• Technical Sprawl
How ?
Production
Readiness Checklist
#1. Code Quality
Linting/
Formatting
• Statically analyse linting and
formatting errors

• Follow Ruby Style Guide

• Helps maintain Code sanity

• Use tools like Rubocop
• `rubocop -a` to autofix certain
errors
Production Ready Microservices at Scale
Code Smells
• Statically analyse code for code
smells like Long Parameter list,
Cyclomatic Complexity,
Uncommunicative name etc
• Easy to change codebases

• Use tools like Reek
Production Ready Microservices at Scale
Secure coding
• Statically analyse code for Security
vulnerabilities
• Identify security issues like XSS,
CSRF, SQL Injection etc

• Use tools like Brakeman
• Identify issues with various
confidence levels
Production Ready Microservices at Scale
#1. Code Quality
• Make it part of your development
process

• Also deployment pipeline

• Helps avoid bugs creeping in the
code

• Helps achieve Stability and
Reliablity of microservice

• Also helps achieve security
#2. Testing
Unit/Integration Testing
• Unit tests test a unit/function (Individual component)

• Integration tests help in testing components working together

• Contract testing helps test interactions of your micro service with its
dependencies
Production Ready Microservices at Scale
Production Ready Microservices at Scale
Contract Testing
Load Testing
• Expose to higher traffic loads to see behaviour

• Makes sure that microservice is ready to accept increased traffic in
future

• Helps identify scalability challenges and bottlenecks

• Helps in determining SLA of a service

• Load testing at GO-JEK using Gatling
Chaos Testing
• Helps determine Unknown Unknowns
• Type of Resiliency testing

• Break systems to understand behaviour

• At GO-JEK ? Manual

• Automated ? Simian Army
#2. Testing
• Can be automated

• Make it part of your development
process

• Also deployment pipeline

• Builds confidence in code

• Helps with Fault tolerance,
Scalability, Performance, Stability
and Reliability
#3. Resilience Patterns
Timeouts/Retries
• Stop waiting for an answer after certain time - Timeout
• Timeout - Fail Fast

• Retry the request after failure

• Retry - Succeed eventually
Examples - ‘net/http’
http = Net::HTTP.new(host, port)
http.open_timeout = 1
http.read_timeout = 1
Circuit Breaker
• Circumvent calls to dependency when calls fail

• Fallback to a fallback response
• Hystrix is a well known implementation

• jruby-hystrix used internally at GO-JEK (Yet to be open sourced)
Circuit Breaker
module CustomerClient
module Hystrix
class HttpCommand < ::Hystrix::HystrixBaseCommand
SERVER_DOWN_HTTP_STATUS_CODES = [500, 501, 502, 503, 504]
def run
http_response = nil
begin
http_response = request_uri.send(request_method, request_params, request_headers) do|
callback|
callback.on(400..511) do |response|
Wrest.logger.warn("<- #{request_method} #{request_uri}")
Wrest.logger.warn("code: #{response.code}: body: #{response.deserialised_body}")
end
end
if SERVER_DOWN_HTTP_STATUS_CODES.include?(http_response.code.to_i)
raise "Server Error with response code #{http_response.code.to_i}"
end
build_response(http_response)
rescue => e
raise HttpCommandException.new(request_uri, request_headers, request_params,
http_response), e.message
end
end
end
end
end
jruby-hystrix
#3. Resiliency
Patterns
• Protect systems from failures of its
dependencies

• Protect dependency when it has
started to fail

• Fail Fast and Succeed Eventually

• Helps with Fault tolerance,
Scalability, Stability and Reliability
#4. Observability
Logging
• Describes state of the micro service

• Helps debugging root cause of an issue

• Structured logging

• Logging needs to be scalable, available and easily accessible (Log
centralisation)

• Logging in Ruby
Production Ready Microservices at Scale
Instrumentation in Code
• Microservice Key metrics (Latencies, errors, api responses etc) -
newrelic_rpm

• StatsD metrics (statsd-instrument) - Custom metrics
• Exception and error tracking metrics (sentry-raven)
StatsD
Initialisation:
`StatsD.backend = case Rails.env
when 'production', 'staging'
StatsD::Instrument::Backends::UDPBackend.new(“local:8125”, :datadog)
when 'test'
StatsD::Instrument::Backends::NullBackend.new
else
StatsD::Instrument::Backends::LoggerBackend.new(StatsD.logger)
end`
Counter:
StatsD.increment(‘customer.login.device.ratelimit.count’)`
Gauge:
StatsD.gauge(‘booking.queued', 12, sample_rate: 1.0)
Dashboards
• Capture Key metrics

• USE (Utilisation, Saturation, Errors)

• RED (Request Rate, Error Rate, Duration)

• CPU, RAM, database connections, latencies etc

• Should represent Health of a service

• Grafana/NewRelic dashboards at GO-JEK
NewRelic - Service level
Grafana - System level
Uptime Dashboard
Alerting
• Alert when failures are detected or changes in key metrics

• Alerts should be actionable
• Answered by people on call/support
• Helps protect availability of systems before they go down

• Alerting at GO-JEK
#4. Observability
• Provides insights into how a service
is behaving and its overall health
• Includes 3 main categories:
Logging, Monitoring and
Distributed Tracing
• Helps in quick incidence response
during outages

• Helps maintain Stability of the
service
#5. CI/CD Pipelines
CI/CD
• Increased Deployment Velocity - 30+ deploys a day

• Bad Deployments major cause of downtimes

• Rollout - Staging -> Production
• Rolling Deploys - Incremental
• Think of Rollbacks at every deployment
Stable Deployment Pipeline
Canary
• Deploy the new change to single node or subset of total nodes

• Monitor key metrics for certain duration

• Promote Canary to Production if all good or else Rollback
• Helps minimise impact of bad deployments in Production
Canary
#5. CI/CD
• Lint/Test/Build/Package/Deploy
steps

• Canaries help have minimal
impacts on Bad deployments

• Rollbacks to minimise impact
• Helps Provide Stability and
Reliability
In Conclusion …
Standardisation is
the Goal
Always being
available is the Goal
More …
• Documentation

• Security

• Capacity Planning

• Outages and Incidence Response (Runbooks)
Future work
• Implementing Production Readiness checklist on the ground

• Review process for micro services

• Production Readiness score

• Automating the checklist
References
• Production Ready Microservices - Susan Fowler

• Google SRE Book

• Microservices Standardisation

• SLA
Thanks for listening
Questions ?
@rBharshetty
@gojektech

More Related Content

PDF
Resilience Planning & How the Empire Strikes Back
PDF
How Mature is Your Infrastructure?
PPTX
Protecting Your IP with Perforce Helix and Interset
PPTX
URP? Excuse You! The Three Kafka Metrics You Need to Know
PPTX
How Samsung Engineers Do Pre-Commit Builds with Perforce Helix Streams
PDF
Pain points of agile development
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Resilience Planning & How the Empire Strikes Back
How Mature is Your Infrastructure?
Protecting Your IP with Perforce Helix and Interset
URP? Excuse You! The Three Kafka Metrics You Need to Know
How Samsung Engineers Do Pre-Commit Builds with Perforce Helix Streams
Pain points of agile development
Patterns and Pains of Migrating Legacy Applications to Kubernetes

What's hot (20)

PPTX
Thick client pentesting_the-hackers_meetup_version1.0pptx
PDF
Stinson post si and verification
PPTX
Kafka at Peak Performance
PPTX
Fluent validation
PDF
5º MeetUP ARQconf 2016 - IoT: What is it really and how does it work?
PPTX
You name it, we analyze it
PPTX
Low-Cost ICS Network Performance Testing
PPTX
Application Performance Management
PDF
Halo Installfest Slides
DOC
Piyush Kumar Gupta
PPTX
Operating a High Velocity Large Organization with Spring Cloud Microservices
PPTX
Free training on Network Configuration Manager - Season 2 - Part 2
PPTX
Performance testing
PDF
SDNs: hot topics, evolution & research opportunities
PDF
Testing the Migration of Monolithic Applications to Microservices on the Cloud
PPTX
Mainframe VUG Presentation April 2016
PPTX
Play With Streams
PDF
VMworld 2013: Building a Validation Factory for VMware Partners
PPTX
Server and application monitoring webinars [Applications Manager] - Part 3
PDF
Agile infrastructure
Thick client pentesting_the-hackers_meetup_version1.0pptx
Stinson post si and verification
Kafka at Peak Performance
Fluent validation
5º MeetUP ARQconf 2016 - IoT: What is it really and how does it work?
You name it, we analyze it
Low-Cost ICS Network Performance Testing
Application Performance Management
Halo Installfest Slides
Piyush Kumar Gupta
Operating a High Velocity Large Organization with Spring Cloud Microservices
Free training on Network Configuration Manager - Season 2 - Part 2
Performance testing
SDNs: hot topics, evolution & research opportunities
Testing the Migration of Monolithic Applications to Microservices on the Cloud
Mainframe VUG Presentation April 2016
Play With Streams
VMworld 2013: Building a Validation Factory for VMware Partners
Server and application monitoring webinars [Applications Manager] - Part 3
Agile infrastructure
Ad

Similar to Production Ready Microservices at Scale (20)

PPTX
Expect the unexpected: Anticipate and prepare for failures in microservices b...
PPTX
Reliability at scale
PDF
Architecting for Failures in micro services: patterns and lessons learned
PDF
Escalando Foursquare basado en Checkins y Recomendaciones
PPTX
Resilience planning and how the empire strikes back
PDF
Expect the unexpected: Prepare for failures in microservices
PDF
From the Drawing Board to the Trenches: Building a Production-ready Application
PDF
Reliability at scale
PDF
Microservices Architecture
PPTX
Cloud to hybrid edge cloud evolution Jun112020.pptx
PDF
Sre With Java Microservices Patterns For Reliable Microservices In The Enterp...
PPTX
CI/CD for mobile at HERE
PDF
High-Speed Reactive Microservices
PDF
Get There meetup March 2018 - Microservices in action at the Dutch National P...
PDF
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
PDF
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
PDF
Unit 1: Apply the Twelve-Factor App to Microservices Architectures
PPTX
Microservices: next-steps
PDF
GeekOut 2017 - Microservices in action at the Dutch National Police
PPTX
Top10 Characteristics of Awesome Apps
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Reliability at scale
Architecting for Failures in micro services: patterns and lessons learned
Escalando Foursquare basado en Checkins y Recomendaciones
Resilience planning and how the empire strikes back
Expect the unexpected: Prepare for failures in microservices
From the Drawing Board to the Trenches: Building a Production-ready Application
Reliability at scale
Microservices Architecture
Cloud to hybrid edge cloud evolution Jun112020.pptx
Sre With Java Microservices Patterns For Reliable Microservices In The Enterp...
CI/CD for mobile at HERE
High-Speed Reactive Microservices
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
Unit 1: Apply the Twelve-Factor App to Microservices Architectures
Microservices: next-steps
GeekOut 2017 - Microservices in action at the Dutch National Police
Top10 Characteristics of Awesome Apps
Ad

More from Rajeev Bharshetty (6)

PPTX
Resiliency in Distributed Systems
PDF
Redux - What is this fuss about ?
PDF
Writing S.O.L.I.D Code
PDF
FunctionalGeekery-RubyConf
Resiliency in Distributed Systems
Redux - What is this fuss about ?
Writing S.O.L.I.D Code
FunctionalGeekery-RubyConf

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology

Production Ready Microservices at Scale

  • 2. Transport, Logistics, Hyperlocal delivery and Payments • 18 Products • 1m+ Drivers • 500+ Microservices • 15k+ Cores • 2 Cloud Providers • 6 Data centers • 100+ million bookings per month
  • 3. Agenda • What are Production Ready Microservices ? • Why do we need them ? • How do we build them ?
  • 4. What is Production Ready ? • Stable • Reliable • Scalable • Performant • Fault Tolerant • Monitored • Prepared for Catastrophe • Secure
  • 5. Why ? • Goal is Availability • Trust services you put in Production • Organisational Sprawl • Increased Development Velocity • Technical Sprawl
  • 9. Linting/ Formatting • Statically analyse linting and formatting errors • Follow Ruby Style Guide • Helps maintain Code sanity • Use tools like Rubocop • `rubocop -a` to autofix certain errors
  • 11. Code Smells • Statically analyse code for code smells like Long Parameter list, Cyclomatic Complexity, Uncommunicative name etc • Easy to change codebases • Use tools like Reek
  • 13. Secure coding • Statically analyse code for Security vulnerabilities • Identify security issues like XSS, CSRF, SQL Injection etc • Use tools like Brakeman • Identify issues with various confidence levels
  • 15. #1. Code Quality • Make it part of your development process • Also deployment pipeline • Helps avoid bugs creeping in the code • Helps achieve Stability and Reliablity of microservice • Also helps achieve security
  • 17. Unit/Integration Testing • Unit tests test a unit/function (Individual component) • Integration tests help in testing components working together • Contract testing helps test interactions of your micro service with its dependencies
  • 21. Load Testing • Expose to higher traffic loads to see behaviour • Makes sure that microservice is ready to accept increased traffic in future • Helps identify scalability challenges and bottlenecks • Helps in determining SLA of a service • Load testing at GO-JEK using Gatling
  • 22. Chaos Testing • Helps determine Unknown Unknowns • Type of Resiliency testing • Break systems to understand behaviour • At GO-JEK ? Manual • Automated ? Simian Army
  • 23. #2. Testing • Can be automated • Make it part of your development process • Also deployment pipeline • Builds confidence in code • Helps with Fault tolerance, Scalability, Performance, Stability and Reliability
  • 25. Timeouts/Retries • Stop waiting for an answer after certain time - Timeout • Timeout - Fail Fast • Retry the request after failure • Retry - Succeed eventually
  • 26. Examples - ‘net/http’ http = Net::HTTP.new(host, port) http.open_timeout = 1 http.read_timeout = 1
  • 27. Circuit Breaker • Circumvent calls to dependency when calls fail • Fallback to a fallback response • Hystrix is a well known implementation • jruby-hystrix used internally at GO-JEK (Yet to be open sourced)
  • 29. module CustomerClient module Hystrix class HttpCommand < ::Hystrix::HystrixBaseCommand SERVER_DOWN_HTTP_STATUS_CODES = [500, 501, 502, 503, 504] def run http_response = nil begin http_response = request_uri.send(request_method, request_params, request_headers) do| callback| callback.on(400..511) do |response| Wrest.logger.warn("<- #{request_method} #{request_uri}") Wrest.logger.warn("code: #{response.code}: body: #{response.deserialised_body}") end end if SERVER_DOWN_HTTP_STATUS_CODES.include?(http_response.code.to_i) raise "Server Error with response code #{http_response.code.to_i}" end build_response(http_response) rescue => e raise HttpCommandException.new(request_uri, request_headers, request_params, http_response), e.message end end end end end jruby-hystrix
  • 30. #3. Resiliency Patterns • Protect systems from failures of its dependencies • Protect dependency when it has started to fail • Fail Fast and Succeed Eventually • Helps with Fault tolerance, Scalability, Stability and Reliability
  • 32. Logging • Describes state of the micro service • Helps debugging root cause of an issue • Structured logging • Logging needs to be scalable, available and easily accessible (Log centralisation) • Logging in Ruby
  • 34. Instrumentation in Code • Microservice Key metrics (Latencies, errors, api responses etc) - newrelic_rpm • StatsD metrics (statsd-instrument) - Custom metrics • Exception and error tracking metrics (sentry-raven)
  • 35. StatsD Initialisation: `StatsD.backend = case Rails.env when 'production', 'staging' StatsD::Instrument::Backends::UDPBackend.new(“local:8125”, :datadog) when 'test' StatsD::Instrument::Backends::NullBackend.new else StatsD::Instrument::Backends::LoggerBackend.new(StatsD.logger) end` Counter: StatsD.increment(‘customer.login.device.ratelimit.count’)` Gauge: StatsD.gauge(‘booking.queued', 12, sample_rate: 1.0)
  • 36. Dashboards • Capture Key metrics • USE (Utilisation, Saturation, Errors) • RED (Request Rate, Error Rate, Duration) • CPU, RAM, database connections, latencies etc • Should represent Health of a service • Grafana/NewRelic dashboards at GO-JEK
  • 40. Alerting • Alert when failures are detected or changes in key metrics • Alerts should be actionable • Answered by people on call/support • Helps protect availability of systems before they go down • Alerting at GO-JEK
  • 41. #4. Observability • Provides insights into how a service is behaving and its overall health • Includes 3 main categories: Logging, Monitoring and Distributed Tracing • Helps in quick incidence response during outages • Helps maintain Stability of the service
  • 43. CI/CD • Increased Deployment Velocity - 30+ deploys a day • Bad Deployments major cause of downtimes • Rollout - Staging -> Production • Rolling Deploys - Incremental • Think of Rollbacks at every deployment
  • 45. Canary • Deploy the new change to single node or subset of total nodes • Monitor key metrics for certain duration • Promote Canary to Production if all good or else Rollback • Helps minimise impact of bad deployments in Production
  • 47. #5. CI/CD • Lint/Test/Build/Package/Deploy steps • Canaries help have minimal impacts on Bad deployments • Rollbacks to minimise impact • Helps Provide Stability and Reliability
  • 51. More … • Documentation • Security • Capacity Planning • Outages and Incidence Response (Runbooks)
  • 52. Future work • Implementing Production Readiness checklist on the ground • Review process for micro services • Production Readiness score • Automating the checklist
  • 53. References • Production Ready Microservices - Susan Fowler • Google SRE Book • Microservices Standardisation • SLA
  • 54. Thanks for listening Questions ? @rBharshetty @gojektech