SlideShare a Scribd company logo
A State of Xen
Chaos Monkey & Cassandra
Who we are
Jean-Sebastien Jeannotte – JS
Senior Software Engineer
Platform Automation Engineering
jjeannotte@netflix.com
@jsjeannotte
http://www.linkedin.com/in/jsjeannotte
Nir Alfasi
Senior Software Engineer
Platform Automation Engineering
alfasi@netflix.com
@niralfasi
http://www.linkedin.com/in/alfasin
Christos Kalantzis
Director of Engineering
Cloud Database Engineering
Cassandra MVP
ckalantzis@netflix.com
@chriskalan
http://www.linkedin.com/in/christoskalantzis
AWS
Bootre:
September 2014, Every AZ
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Our stack during Re:boot 2014
C*
Priam
C*
Priam
C*
Priam
REST + SSH
Our stack during Re:boot 2014
Our stack during Re:boot 2014
Our stack during Re:boot 2014
C*
Priam
C*
Priam
C*
Priam
REST + SSH
AtlasAtlasApp
1
App
2
Our stack during Re:boot 2014
Our stack during Re:boot 2014
Disappearing
instance?
Launch new
instance
All good
Is the C* ring
healthy?
Are all instances
healthy?
All good
Can we fix
automatically?
Replace bad
instance
All good
Is there an
offline
maintenance?
First failure?
Sleep for X
minutes and
retry
PagerDuty
Is there an
offline
maintenance?
First failure?
All good
Every
30 min
Our stack during Re:boot 2014
AWS
Bootre:
September 2014, Every AZ
Gaps we identified
Gaps we identified
Gaps we identified
Gaps we identified
New direction
New direction – What others are doing
New direction – What we decided to do
New direction – What we decided to do
New direction – What we decided to do
C*
Priam
C*
Priam
C*
Priam
AtlasAtlasApp
1
App
2
New direction – What we learned (principles)
New direction – What we learned (principles)
New direction – What we learned (principles)
Synchronous Asynchronous
SSH HTTP / REST
New direction – What we learned (principles)
New direction – What we learned (principles)
What does the future look like?
What does the future look like?
What does the future look like?
Check out our https://jobs.netflix.com page for current
openings
Who we are
Jean-Sebastien Jeannotte – JS
Senior Software Engineer
Platform Automation Engineering
jjeannotte@netflix.com
@jsjeannotte
http://www.linkedin.com/in/jsjeannotte
Nir Alfasi
Senior Software Engineer
Platform Automation Engineering
alfasi@netflix.com
@niralfasi
http://www.linkedin.com/in/alfasin
Christos Kalantzis
Director of Engineering
Cloud Database Engineering
Cassandra MVP
ckalantzis@netflix.com
@chriskalan
http://www.linkedin.com/in/christoskalantzis

More Related Content

PDF
Netflix: A State of Xen - Chaos Monkey & Cassandra
PDF
AtlasCamp 2013: Bring your own Stack
PDF
What's New in JHipsterLand - Devoxx US 2017
PDF
Rethinking Angular Architecture & Performance
PDF
The Ultimate Getting Started with Angular Workshop - Devoxx France 2017
PPTX
Terraform for Azure Quickstart
PPTX
Intro to Netflix's Chaos Monkey
PDF
Is Serverless The New Swiss Cheese?
Netflix: A State of Xen - Chaos Monkey & Cassandra
AtlasCamp 2013: Bring your own Stack
What's New in JHipsterLand - Devoxx US 2017
Rethinking Angular Architecture & Performance
The Ultimate Getting Started with Angular Workshop - Devoxx France 2017
Terraform for Azure Quickstart
Intro to Netflix's Chaos Monkey
Is Serverless The New Swiss Cheese?

What's hot (20)

PDF
My Top 5 Favorite Gems
PDF
RxJS - The Basics & The Future
PDF
SFScon18 - Juri Strumpflohner - End-to-end testing done right!
PDF
Exactly once delivery is a harsh mistress - DevOps Days TLV
PPTX
Azure Portal - the largest SPA in the World
PDF
mykola marzhan - jenkins on aws spot instance
PDF
The Power of RxJS in Nativescript + Angular
ODP
Testing Grails Applications With Selenium Rc
PDF
Development, Deployment & Collaboration at Etsy
PDF
Mobile CI at Etsy
PDF
Building Services on and off Rails
PPTX
Developer day - AWS: Fast Environments = Fast Deployments
PDF
How Percolate uses CFEngine to Manage AWS Stateless Infrastructure
PDF
Swift + GraphQL
PDF
Technology | Serverless
PDF
DevOps with Serverless
PDF
Angular is one fire(base)! - Shmuela Jacobs
ODP
Elm & Elixir: Functional Programming and Web
PDF
RxJS: A Beginner & Expert's Perspective - ng-conf 2017
PDF
presentation-chaos-monkey
My Top 5 Favorite Gems
RxJS - The Basics & The Future
SFScon18 - Juri Strumpflohner - End-to-end testing done right!
Exactly once delivery is a harsh mistress - DevOps Days TLV
Azure Portal - the largest SPA in the World
mykola marzhan - jenkins on aws spot instance
The Power of RxJS in Nativescript + Angular
Testing Grails Applications With Selenium Rc
Development, Deployment & Collaboration at Etsy
Mobile CI at Etsy
Building Services on and off Rails
Developer day - AWS: Fast Environments = Fast Deployments
How Percolate uses CFEngine to Manage AWS Stateless Infrastructure
Swift + GraphQL
Technology | Serverless
DevOps with Serverless
Angular is one fire(base)! - Shmuela Jacobs
Elm & Elixir: Functional Programming and Web
RxJS: A Beginner & Expert's Perspective - ng-conf 2017
presentation-chaos-monkey
Ad

Similar to Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra (20)

PPTX
Svc 202-netflix-open-source
PDF
Chaos Engineering 시작하기 - 윤석찬 (AWS 테크에반젤리스트) :: 한국 카오스엔지니어링 밋업
PPTX
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
PPTX
The Case for Chaos
PPTX
Dystopia as a Service
PDF
Netflix Global Applications - NoSQL Search Roadshow
PPTX
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
PDF
Pythian: My First 100 days with a Cassandra Cluster
PPTX
Chaos Engineering when you're not Netflix
PDF
Platform Clouds, Containers, Immutable Infrastructure Oh My!
PDF
20140708 - Jeremy Edberg: How Netflix Delivers Software
PPTX
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
PPTX
Embracing Failure - Fault Injection and Service Resilience at Netflix
PDF
Kubernates를 위한 Chaos Engineering in Action :: 윤석찬 (AWS 테크에반젤리스트)
PDF
TechEvent 2019: Chaos Engineering - here we go; Lothar Wieske - Trivadis
PDF
Polyglot persistence @ netflix (CDE Meetup)
PDF
Data Stores @ Netflix
PDF
Continuous Deployment @ AWS Re:Invent
PDF
Scheduling a fuller house - Talk at QCon NY 2016
PDF
Netflix Container Scheduling and Execution - QCon New York 2016
Svc 202-netflix-open-source
Chaos Engineering 시작하기 - 윤석찬 (AWS 테크에반젤리스트) :: 한국 카오스엔지니어링 밋업
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
The Case for Chaos
Dystopia as a Service
Netflix Global Applications - NoSQL Search Roadshow
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Pythian: My First 100 days with a Cassandra Cluster
Chaos Engineering when you're not Netflix
Platform Clouds, Containers, Immutable Infrastructure Oh My!
20140708 - Jeremy Edberg: How Netflix Delivers Software
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Embracing Failure - Fault Injection and Service Resilience at Netflix
Kubernates를 위한 Chaos Engineering in Action :: 윤석찬 (AWS 테크에반젤리스트)
TechEvent 2019: Chaos Engineering - here we go; Lothar Wieske - Trivadis
Polyglot persistence @ netflix (CDE Meetup)
Data Stores @ Netflix
Continuous Deployment @ AWS Re:Invent
Scheduling a fuller house - Talk at QCon NY 2016
Netflix Container Scheduling and Execution - QCon New York 2016
Ad

Recently uploaded (20)

PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
composite construction of structures.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
OOP with Java - Java Introduction (Basics)
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
composite construction of structures.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CH1 Production IntroductoryConcepts.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
additive manufacturing of ss316l using mig welding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CYBER-CRIMES AND SECURITY A guide to understanding
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT-1 - COAL BASED THERMAL POWER PLANTS

Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra

  • 1. A State of Xen Chaos Monkey & Cassandra
  • 2. Who we are Jean-Sebastien Jeannotte – JS Senior Software Engineer Platform Automation Engineering jjeannotte@netflix.com @jsjeannotte http://www.linkedin.com/in/jsjeannotte Nir Alfasi Senior Software Engineer Platform Automation Engineering alfasi@netflix.com @niralfasi http://www.linkedin.com/in/alfasin Christos Kalantzis Director of Engineering Cloud Database Engineering Cassandra MVP ckalantzis@netflix.com @chriskalan http://www.linkedin.com/in/christoskalantzis
  • 8. Our stack during Re:boot 2014 C* Priam C* Priam C* Priam REST + SSH
  • 9. Our stack during Re:boot 2014
  • 10. Our stack during Re:boot 2014
  • 11. Our stack during Re:boot 2014 C* Priam C* Priam C* Priam REST + SSH AtlasAtlasApp 1 App 2
  • 12. Our stack during Re:boot 2014
  • 13. Our stack during Re:boot 2014 Disappearing instance? Launch new instance All good Is the C* ring healthy? Are all instances healthy? All good Can we fix automatically? Replace bad instance All good Is there an offline maintenance? First failure? Sleep for X minutes and retry PagerDuty Is there an offline maintenance? First failure? All good Every 30 min
  • 14. Our stack during Re:boot 2014 AWS Bootre: September 2014, Every AZ
  • 20. New direction – What others are doing
  • 21. New direction – What we decided to do
  • 22. New direction – What we decided to do
  • 23. New direction – What we decided to do C* Priam C* Priam C* Priam AtlasAtlasApp 1 App 2
  • 24. New direction – What we learned (principles)
  • 25. New direction – What we learned (principles)
  • 26. New direction – What we learned (principles) Synchronous Asynchronous SSH HTTP / REST
  • 27. New direction – What we learned (principles)
  • 28. New direction – What we learned (principles)
  • 29. What does the future look like?
  • 30. What does the future look like?
  • 31. What does the future look like?
  • 32. Check out our https://jobs.netflix.com page for current openings
  • 33. Who we are Jean-Sebastien Jeannotte – JS Senior Software Engineer Platform Automation Engineering jjeannotte@netflix.com @jsjeannotte http://www.linkedin.com/in/jsjeannotte Nir Alfasi Senior Software Engineer Platform Automation Engineering alfasi@netflix.com @niralfasi http://www.linkedin.com/in/alfasin Christos Kalantzis Director of Engineering Cloud Database Engineering Cassandra MVP ckalantzis@netflix.com @chriskalan http://www.linkedin.com/in/christoskalantzis

Editor's Notes

  • #2: Building a house of cards on a solid database foundation.
  • #3: Lead Cloud database Engineering for Netflix. Among other things, we offer C* as a service within Netflix. Feel free to follow me on Twitter or link up on LinkedIn.
  • #4: Talk about the Simian Army - introduce simian army Netflix LOVES chaos. We love it so much that we generate it. Monkey - run in prod Kong - Exercice We run it on most of Netflix services, and even on C*
  • #5: Talk about the Simian Army - introduce simian army Netflix LOVES chaos. We love it so much that we generate it. Monkey - run in prod Kong - Exercice We run it on most of Netflix services, and even on C*
  • #6: Talk about the Simian Army - introduce simian army Netflix LOVES chaos. We love it so much that we generate it. Monkey - run in prod Kong - Exercice We run it on most of Netflix services, and even on C*
  • #7: CDE has Chaos Monkey enabled on our C* clusters Maximum 1 node per day, during business hours Our Healthcheck dectects the missing instance and replaces it
  • #8: 218 C* nodes rebooted 22 nodes didn’t start and were automatically terminated by AWS internal healthcheck Our heathcheck identified the missing nodes and automatically remediated the issue 0 downtime
  • #9: - Bunch of Python/Shell scripts - Jenkins as job scheduler (HC, node-replacements, repairs, upgrades and etc) - On C* nodes: C* + Priam - Is something missing? Monitoring? OpsCenter?
  • #10: - Why not OpsCenter? - Didn’t exist when Netflix started using C* - Redundant in our stack
  • #11: ( continuation on why not OpsCenter) - change slide according to christos's feedback - Atlas is already a very powerful metrics and alerting tool, and our metric systems add non-C* related metrics (App metrics for example) that help in correlation. Alerts can be a combination of C* and App metrics. - How it behaved during the Re:boot - How did the healthcheck behave, how does it work and react to Chaos Monkey
  • #12: ( continuation on why not OpsCenter) Atlas is already a very powerful metrics and alerting tool, and our metric systems add non-C* related metrics (App metrics for example) that help in correlation.
  • #13: ( continuation on why not OpsCenter) Alerts can be a combination of C* and App metrics.
  • #14: Healthcheck flow 2 scenarios are automatically remediated
  • #15: How did the healthcheck behave during Re:boot
  • #16: HC - Big monolith About 100k lines of Python/Bash scripts Hard to maintain
  • #17: Lack of chaining (statefulness: if this job failed run that, else…) Stateless Lack of native support for TRIGGERING jobs based on events, like listening to SQS queues
  • #18: High Availability: The Jenkins master node is a Single Point of Failure Long running processes may crash due to a transient connection issue between the slave & the master
  • #19: High Availability: The Jenkins master node is a Single Point of Failure Long running processes may crash due to a transient connection issue between the slave & the master
  • #20: What we learned, and what we decided to focus on (Principles)
  • #21: What others are doing: Facebook (FBAR) / LinkedIn (Nurse) / DropBox (Naoru)
  • #22: Do our own or adopt existing solution? We started with our own POC, then we decided to go with Stackstorm-  event-driven automation platform Facilitated Troubleshooting/Event handling Automated remediation (Discovery example)
  • #23: Do our own or adopt existing solution? We started with our own POC, then we decided to go with Stackstorm-  event-driven automation platform Facilitated Troubleshooting/Event handling Automated remediation (Discovery example)
  • #24: What we decided to do: new env SackStorm-desc (rules/actions…) Example of the Disk Space Alert gap recap
  • #25: Idempotence (make a stateless system feels like a stateful system) Automation tools need to assure that you reach a certain state Example: Downloading the C* tarball: First, check the nodetool version
  • #26: K.I.S.S. - “Simplicity is the ultimate sophistication”  (Example: Resumable repairs - make more concise)
  • #27: Prefer HTTP over SSH and Async over Sync
  • #28: Retries with Timeouts and exponential back-off
  • #29: Serving-fallbacks Example: Dynamic property service with hard-coded defaults Netflix personalized recommendations falling back to default recommendations
  • #31: Audit trail: use logstash to index data into Elasticsearch for Trend Analysis - Talk about the fact that we already use LogStash @ Netflix, but we want to plug it into our automated remediation system
  • #32: Metadata / Statistics / Long term metrics Use Trend Analysis to be proactive instead of reactive: Disk usage to predict when we need to increase the cluster size with automated resizing
  • #34: Lead Cloud database Engineering for Netflix. Among other things, we offer C* as a service within Netflix. Feel free to follow me on Twitter or link up on LinkedIn.