SlideShare a Scribd company logo
Resilience and Compliance
at Speed and Scale
ISACA SV Spring Conference
Jason Chan
chan@netflix.com
linkedin.com/in/jasonbchan
@chanjbs
About Me
 Engineering Director @ Netflix:
 Security: product, app, ops, IR, fraud/abuse
 Previously:
 Led infosec team @ VMware
 Consultant - @stake, iSEC Partners
About Netflix
Common Approaches to Reslience
Common Controls to Promote Resilience
 Architectural committees
 Change approval boards
 Centralized deployments
 Vendor-specific, component-
level HA
 Standards and checklists
 Designed to standardize on
design patterns, vendors, etc.
 Problems for Netflix:
 Freedom and Responsibility
Culture
 Highly aligned and loosely
coupled
 Innovation cycles
Common Controls to Promote Resilience
 Architectural committees
 Change approval boards
 Centralized deployments
 Vendor-specific, component-
level HA
 Standards and checklists
 Designed to control and de-
risk change
 Focus on artifacts, test and
rollback plans
 Problems for Netflix:
 Freedom and Responsibility
Culture
 Highly aligned and loosely
coupled
 Innovation cycles
Common Controls to Promote Resilience
 Architectural committees
 Change approval boards
 Centralized deployments
 Vendor-specific, component-
level HA
 Standards and checklists
 Separate Ops team deploys at
a pre-ordained time (e.g.
weekly, monthly)
 Problems for Netflix:
 Freedom and Responsibility
Culture
 Highly aligned and loosely
coupled
 Innovation cycles
Common Controls to Promote Resilience
 Architectural committees
 Change approval boards
 Centralized deployments
 Vendor-specific, component-
level HA
 Standards and checklists
 High reliance on vendor
solutions to provide HA and
resilience
 Problems for Netflix:
 Traditional data center oriented
systems do not translate well
to the cloud
 Heavy use of open source
Common Controls to Promote Resilience
 Architectural committees
 Change approval boards
 Centralized deployments
 Vendor-specific, component-
level HA
 Standards and checklists
 Designed for repeatable
execution
 Problems for Netflix:
 Reliance on humans
Approaches to Resilience @ Netflix
What does the business value?
 Customer experience
 Innovation and agility
 In other words:
 Stability and availability for customer experience
 Rapid development and change to continually improve product
and outpace competition
 Not that different from anyone else
Overall Approach
 Understand and solve for relevant failure modes
 Rely on automation and tools, not humans or
committees
 Make no assumptions that planned controls will work
 Provide train tracks and guardrails, but invite deviation
Resilience and Compliance at Speed and Scale
Goals of Simian Army
“Each system has to be able to succeed, no matter what, even all on its own.
We’re designing each distributed system to expect and tolerate failure from
other systems on which it depends.”
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Systems fail
Resilience and Compliance at Speed and Scale
Chaos Monkey
 “By frequently causing failures, we force our services to
be built in a way that is more resilient.”
 Terminates cluster nodes during business hours
 Rejects “If it ain’t broke, don’t fix it”
 Goals:
 Simulate random hardware failures, human error at small scale
 Identify weaknesses
 No service impact
Lots of systems fail
Resilience and Compliance at Speed and Scale
Chaos Gorilla
 Chaos Monkey’s bigger brother
 Standard deployment pattern is to distribute
load/systems/data across three data centers (AZs)
 What happens if one is lost?
 Goals:
 Simulate data center loss, hardware/service failures at larger
scale
 Identify weaknesses, dependencies, etc.
 Minimal service impact
What about larger catastrophes?
Resilience and Compliance at Speed and Scale
Chaos Kong
 Simulate an entire region (US west coast, US east coast)
failing
 For example – hurricane, large winter storm, earthquake, etc.
 Goals:
 Exercise end-to-end large-scale failover (routing, DNS, scaling
up)
The sick and wounded
Resilience and Compliance at Speed and Scale
Latency Monkey
 Distributed systems have many upstream/downstream
connections
 How fault-tolerant are systems to dependency
failure/slowdown?
 Goals:
 Simulate latencies and error codes, see how a service responds
 Survivable services regardless of dependencies
Outliers and rebels
Resilience and Compliance at Speed and Scale
Conformity Monkey
 Without architecture review, how do you ensure designs
leverage known successful patterns?
 Conformity Monkey provides automated analysis for
pattern adherence
 Goals:
 Evaluate deployment modes (data center distribution)
 Evaluate health checks, discoverability, versions of key libraries
 Help ensure service has best chance of successful operation
Cruft, junk, and clutter
Resilience and Compliance at Speed and Scale
Janitor Monkey
 Clutter accumulates, in the form of:
 Complexity
 Vulnerabilities
 Cost
 Janitor identifies unused resources and reaps them to
save money and reduce exposure
 Goals:
 Automated hygiene
 More freedom for engineers to innovate and move fast
Non-Simian Approaches
 Org model
 Engineers write, deploy, support code
 Culture
 De-centralized with as few processes and rules as possible
 Lots of local autonomy
 “If you’re not failing, you’re not trying hard enough”
 Peer pressure
 Productive and transparent incident reviews
Software Deployment for Compliance-Sensitive Apps
Control Objectives for Software Deployments
Visibility and transparency
 Who did what, when?
 What was the scope of the
change or deployment?
 Was it reviewed?
 Was it tested?
 Was it approved?
Typically attempted via:
 Restricted access/SoD
 CMDBs
 Change management
processes
 Test results
 Change windows
Large and Dynamic Systems Need a Different Approach
 No operations organization
 No acceptable windows for downtime
 Thousands of deployments and changes per day
Control Objectives Haven’t Changed
Visibility and transparency
 Who did what, when?
 What was the scope of the change or deployment?
 Was it reviewed?
 Was it tested?
 Was it approved?
System-wide view on changes
Access to changes by app,
region, environment, etc.
Lookback in time
as needed
Changes, via email
When?
By who?
What changed?
Integrated awareness
Chat integration
lets engineers
easily access info
Automated testing
Resilience and Compliance at Speed and Scale
1000+ tests to compare
proposed vs. existing
Automated scoring and
deployment decision
Complete view of deployment lifecycle
Jenkins
(CI) job
App name
Currently
running clusters
by
region/environm
ent
Cluster ID
Deployment
details
AMI version
SCM commit
Modified
files
Source
diffs
Link to
relevant
JIRA(s)
Resilience and Compliance at Speed and Scale
Takeaway
 Control objectives have not changed, but advantages of
new technologies and operational models dictate
updated approaches
Netflix References
 http://netflix.github.com
 http://techblog.netflix.com
 http://slideshare.net/netflix
Questions?
chan@netflix.com

More Related Content

PPTX
Resilience and Security @ Scale: Lessons Learned
PPTX
Cloud Application Security: Lessons Learned
PDF
From Gates to Guardrails: Alternate Approaches to Product Security
PPTX
Splitting the Check on Compliance and Security
KEY
Cloud Security at Netflix
PDF
The Joy of Proactive Security
PDF
DevSecOps: Taking a DevOps Approach to Security
PPTX
Integrating Security into DevOps
Resilience and Security @ Scale: Lessons Learned
Cloud Application Security: Lessons Learned
From Gates to Guardrails: Alternate Approaches to Product Security
Splitting the Check on Compliance and Security
Cloud Security at Netflix
The Joy of Proactive Security
DevSecOps: Taking a DevOps Approach to Security
Integrating Security into DevOps

What's hot (15)

PDF
Security at the Speed of Software Development
PDF
Proactive Security AppSec Case Study
PPTX
Overcoming Security Challenges in DevOps
PDF
Best Practices for Workload Security: Securing Servers in Modern Data Center ...
PPTX
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
PPTX
Cloud Security Essentials 2.0 at RSA
PPTX
ISACA Ireland Keynote 2015
PPTX
DevOps In Azure: Deliver Value With Automation
PDF
Chaos Engineering and Systems Reliability
PPTX
Shared Security Responsibility for the Azure Cloud
PPTX
DevSecCon KeyNote London 2015
PPTX
Azure Security Center
PPTX
DevSecOps - CrikeyCon 2017
PPTX
CSS17: Atlanta - Realities of Security in the Cloud
PDF
Managed Threat Detection & Response for AWS Applications
Security at the Speed of Software Development
Proactive Security AppSec Case Study
Overcoming Security Challenges in DevOps
Best Practices for Workload Security: Securing Servers in Modern Data Center ...
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
Cloud Security Essentials 2.0 at RSA
ISACA Ireland Keynote 2015
DevOps In Azure: Deliver Value With Automation
Chaos Engineering and Systems Reliability
Shared Security Responsibility for the Azure Cloud
DevSecCon KeyNote London 2015
Azure Security Center
DevSecOps - CrikeyCon 2017
CSS17: Atlanta - Realities of Security in the Cloud
Managed Threat Detection & Response for AWS Applications
Ad

Viewers also liked (20)

PDF
Amazon Web Services Security
PDF
The Psychology of Security Automation
PPTX
Defending Netflix from Abuse
PPTX
Cloud Application Security: Lessons Learned
PDF
Practical Cloud Security
PDF
Practical Security Automation
PDF
Careers in Security
KEY
Real World Cloud Application Security
PDF
Security at Scale - Lessons from Six Months at Yahoo
PDF
Analyze System and Code Interactions
PPTX
Virtualization: Security and IT Audit Perspectives
PDF
Cloud Security @ Netflix
PPTX
Ibm cloud nativenetflixossfinal
PPTX
Re:invent 2016 Container Scheduling, Execution and AWS Integration
PDF
Netflix Global Applications - NoSQL Search Roadshow
PDF
Netflix Cloud Platform and Open Source
PDF
Netflix OSS Meetup Season 4 Episode 4
KEY
AWS Security: A Practitioner's Perspective
PPTX
Netflix Webkit-Based UI for TV Devices
PDF
Netflix and Containers: Not A Stranger Thing
Amazon Web Services Security
The Psychology of Security Automation
Defending Netflix from Abuse
Cloud Application Security: Lessons Learned
Practical Cloud Security
Practical Security Automation
Careers in Security
Real World Cloud Application Security
Security at Scale - Lessons from Six Months at Yahoo
Analyze System and Code Interactions
Virtualization: Security and IT Audit Perspectives
Cloud Security @ Netflix
Ibm cloud nativenetflixossfinal
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Netflix Global Applications - NoSQL Search Roadshow
Netflix Cloud Platform and Open Source
Netflix OSS Meetup Season 4 Episode 4
AWS Security: A Practitioner's Perspective
Netflix Webkit-Based UI for TV Devices
Netflix and Containers: Not A Stranger Thing
Ad

Similar to Resilience and Compliance at Speed and Scale (20)

PPTX
Enterprise DevOps: Scaling Build, Deploy, Test, Release
PPTX
Dev ops developer (session 3)
PPTX
Continuous Delivery and Continuous Agile by Andy Singleton - Agile Maine Day...
PPTX
Implementing a testing strategy
PPTX
Use DevOps to Respond Faster to End Customers
PDF
From Monoliths to Microservices at Realestate.com.au
PPT
Risk Driven Testing
PDF
Continuous delivery
PDF
Andy singleton continuous delivery-fcb - nov 2014
ODP
Best practice adoption (and lack there of)
PPTX
ალექსანდრე ნემსაძე - Release it
PPTX
Enterprise DevOps
PPT
Anti Patterns Siddhesh Lecture2 Of3
PDF
DevOps Roadshow - removing barriers between development and operations
PPTX
Large scale agile development practices
PDF
Raise the Bar! Reloaded
PDF
Raise the bar! Reloaded
PDF
Encontrando la Aguja en el Rendimiento de Aplicaciones
PPTX
Curiosity Software Presents: Modelling for Continuous Testing
PDF
No Devops Without Continuous Testing
Enterprise DevOps: Scaling Build, Deploy, Test, Release
Dev ops developer (session 3)
Continuous Delivery and Continuous Agile by Andy Singleton - Agile Maine Day...
Implementing a testing strategy
Use DevOps to Respond Faster to End Customers
From Monoliths to Microservices at Realestate.com.au
Risk Driven Testing
Continuous delivery
Andy singleton continuous delivery-fcb - nov 2014
Best practice adoption (and lack there of)
ალექსანდრე ნემსაძე - Release it
Enterprise DevOps
Anti Patterns Siddhesh Lecture2 Of3
DevOps Roadshow - removing barriers between development and operations
Large scale agile development practices
Raise the Bar! Reloaded
Raise the bar! Reloaded
Encontrando la Aguja en el Rendimiento de Aplicaciones
Curiosity Software Presents: Modelling for Continuous Testing
No Devops Without Continuous Testing

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Per capita expenditure prediction using model stacking based on satellite ima...
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Resilience and Compliance at Speed and Scale

  • 1. Resilience and Compliance at Speed and Scale ISACA SV Spring Conference Jason Chan chan@netflix.com linkedin.com/in/jasonbchan @chanjbs
  • 2. About Me  Engineering Director @ Netflix:  Security: product, app, ops, IR, fraud/abuse  Previously:  Led infosec team @ VMware  Consultant - @stake, iSEC Partners
  • 5. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed to standardize on design patterns, vendors, etc.  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  • 6. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed to control and de- risk change  Focus on artifacts, test and rollback plans  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  • 7. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Separate Ops team deploys at a pre-ordained time (e.g. weekly, monthly)  Problems for Netflix:  Freedom and Responsibility Culture  Highly aligned and loosely coupled  Innovation cycles
  • 8. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  High reliance on vendor solutions to provide HA and resilience  Problems for Netflix:  Traditional data center oriented systems do not translate well to the cloud  Heavy use of open source
  • 9. Common Controls to Promote Resilience  Architectural committees  Change approval boards  Centralized deployments  Vendor-specific, component- level HA  Standards and checklists  Designed for repeatable execution  Problems for Netflix:  Reliance on humans
  • 11. What does the business value?  Customer experience  Innovation and agility  In other words:  Stability and availability for customer experience  Rapid development and change to continually improve product and outpace competition  Not that different from anyone else
  • 12. Overall Approach  Understand and solve for relevant failure modes  Rely on automation and tools, not humans or committees  Make no assumptions that planned controls will work  Provide train tracks and guardrails, but invite deviation
  • 14. Goals of Simian Army “Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.” http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
  • 17. Chaos Monkey  “By frequently causing failures, we force our services to be built in a way that is more resilient.”  Terminates cluster nodes during business hours  Rejects “If it ain’t broke, don’t fix it”  Goals:  Simulate random hardware failures, human error at small scale  Identify weaknesses  No service impact
  • 20. Chaos Gorilla  Chaos Monkey’s bigger brother  Standard deployment pattern is to distribute load/systems/data across three data centers (AZs)  What happens if one is lost?  Goals:  Simulate data center loss, hardware/service failures at larger scale  Identify weaknesses, dependencies, etc.  Minimal service impact
  • 21. What about larger catastrophes?
  • 23. Chaos Kong  Simulate an entire region (US west coast, US east coast) failing  For example – hurricane, large winter storm, earthquake, etc.  Goals:  Exercise end-to-end large-scale failover (routing, DNS, scaling up)
  • 24. The sick and wounded
  • 26. Latency Monkey  Distributed systems have many upstream/downstream connections  How fault-tolerant are systems to dependency failure/slowdown?  Goals:  Simulate latencies and error codes, see how a service responds  Survivable services regardless of dependencies
  • 29. Conformity Monkey  Without architecture review, how do you ensure designs leverage known successful patterns?  Conformity Monkey provides automated analysis for pattern adherence  Goals:  Evaluate deployment modes (data center distribution)  Evaluate health checks, discoverability, versions of key libraries  Help ensure service has best chance of successful operation
  • 30. Cruft, junk, and clutter
  • 32. Janitor Monkey  Clutter accumulates, in the form of:  Complexity  Vulnerabilities  Cost  Janitor identifies unused resources and reaps them to save money and reduce exposure  Goals:  Automated hygiene  More freedom for engineers to innovate and move fast
  • 33. Non-Simian Approaches  Org model  Engineers write, deploy, support code  Culture  De-centralized with as few processes and rules as possible  Lots of local autonomy  “If you’re not failing, you’re not trying hard enough”  Peer pressure  Productive and transparent incident reviews
  • 34. Software Deployment for Compliance-Sensitive Apps
  • 35. Control Objectives for Software Deployments Visibility and transparency  Who did what, when?  What was the scope of the change or deployment?  Was it reviewed?  Was it tested?  Was it approved? Typically attempted via:  Restricted access/SoD  CMDBs  Change management processes  Test results  Change windows
  • 36. Large and Dynamic Systems Need a Different Approach  No operations organization  No acceptable windows for downtime  Thousands of deployments and changes per day
  • 37. Control Objectives Haven’t Changed Visibility and transparency  Who did what, when?  What was the scope of the change or deployment?  Was it reviewed?  Was it tested?  Was it approved?
  • 39. Access to changes by app, region, environment, etc. Lookback in time as needed
  • 46. 1000+ tests to compare proposed vs. existing Automated scoring and deployment decision
  • 47. Complete view of deployment lifecycle
  • 48. Jenkins (CI) job App name Currently running clusters by region/environm ent
  • 52. Takeaway  Control objectives have not changed, but advantages of new technologies and operational models dictate updated approaches
  • 53. Netflix References  http://netflix.github.com  http://techblog.netflix.com  http://slideshare.net/netflix