SlideShare a Scribd company logo
A Cloud Gateway -
A Large Scale Company’s First Line
of Defense
Mikey Cohen
Manager - Edge Gateway
Netflix
Today, more than 36% of
North America’s internet
traffic is controlled by
systems in the Amazon
Cloud
Rethinking Cloud Proxies
Global Streaming of TV Shows and
Movies
Nearly 70 Million Subscribers
In over 80 Countries
Netflix accounts for over 36% of
Downstream Traffic in North
America
From the Internet to Services in the Cloud
Gateway
Gateway
?????
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Our Edge Gateway @ Netflix
Handles most netflix.com hosts
Over 20 production Zuul clusters
~ 50 elbs
Gateway handles ~10 origin services
Netflix Gateway Scale
Tens of billions of requests per day
3 AWS regions
Over 1000 device types
Hundreds of permutations of protocols and
device versions
Success
Evolution
Scale
Failure
Our Journey
So What!? - Change your perspective!!
Traditional Cloud Proxy Mission
Simple static rule-based routing
API portal
Request authentication
Throttling - request caps
Monitoring
Caching
The Gateway - a grown-up proxy!
●Dynamic routing
●Deep Insights
●Load balancing
●Availability focused
●Service protection
●Quality assurance tool
Evolving to a Gateway
Netflix’s Public API
Late 2008
Mashery
Datacenter
Streaming Devices using public API
Early Streaming Devices - 2009
Windows Media Center
XBox
PS3
Migration to AWS
2010
Sonoa / Apigee proxy
Device traffic, not public
Controlling DC -> cloud
migration
Running in AWS
Under Netflix control
Streaming Success
2011
Chaos
Complexity
Failure
Success
Leveraging
Cloud benefits
Anti-patterns of most cloud proxies
Static configurations
Service push needed to
change behavior
Limited range of
functionality
Limited to HTTP
Zuul Created
2012
Dynamically injected and compiled filters
Manipulate requests and responses
Headers / Body / etc
Change routing
Add metrics and other functions
Built on Netflix’s OSS stack
Open Sourced
Zuul - A Victim of Success
Easy and convenient
Instant results
High adoption
Happy customers
Business logic in proxy
Affects system resiliency
Zuul team in critical path
Creating a Gateway
Strategy
Principles of Netflix’s Gateway Strategy
Creative Routing
Dynamic Routing
Delivery Focused
Traffic Shaping
React Fast
Insights
Creative Routing - Subclusters with Purpose
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
Red / Green Deployments
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
Instrumented
squeeze
squeeze
Developer Test Branches
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
Instrumented
squeeze
squeeze
Instrumented Clusters
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Squeeze Testing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
Targeted Routing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debu
g
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
Service “Canarying”
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
“Sticky” Canary
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Failure Injection Testing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Degraded Experience Testing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Traffic Shaping
A Global Cloud Deployment
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Global Cloud Routing
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
A Failing region
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Gateway routing to other regions
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Attack prevention
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Smart Load Balancing
Gateway
Gateway
Gateway
Origin (API)
Smart Load Balancing - Bad Nodes
Gateway
Gateway
Gateway
Origin (API)
Gateway Backoff and Blacklists Bad Nodes
Gateway
Gateway
Gateway
Origin (API)
Zone Failure - Blacklist the Zone automatically
Gateway
Gateway
Gateway
Origin (API)
React Quickly - Runtime Filter changes
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Runtime Policy
Injection
A Room with a View - Insights
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Insights
What’s Next for Netflix’s Gateway?
Gateway as a service
Self-service dynamic routing / route validation
Control APIs for special routing functions
Netty Based Zuul (using RxNetty)
Handling persistent connections
non-blocking, async
Transport protocol agnostic routing
Reactive Socket http://reactivesocket.io/
Top Ten Lessons Learned
Build for handling
Failures
Expect the Unexpected
Using Routing Creatively
Shard to Reduce Blast
Radius
Devices are Weird
Protocols are Weird
Devices are Forever
Protocols are Forever
It will be built “wrong”
Keep Business Logic out
of your Gateway
For More Info...
Zuul OSS
Netflix Tech Blog
RxNetty
Jobs

More Related Content

PPTX
Zuul @ Netflix SpringOne Platform
PPTX
Scaling Push Messaging for Millions of Netflix Devices
PDF
Architecting for Success: Designing Secure GCP Landing Zone for Enterprises
PPTX
AWS VPC & Networking basic concepts
PPTX
Monetization: Unlock More Value from Your APIs
PDF
API Security Best Practices & Guidelines
PPTX
End to End Security With Palo Alto Networks (Onur Kasap, engineer Palo Alto N...
Zuul @ Netflix SpringOne Platform
Scaling Push Messaging for Millions of Netflix Devices
Architecting for Success: Designing Secure GCP Landing Zone for Enterprises
AWS VPC & Networking basic concepts
Monetization: Unlock More Value from Your APIs
API Security Best Practices & Guidelines
End to End Security With Palo Alto Networks (Onur Kasap, engineer Palo Alto N...

What's hot (20)

PDF
AWS Summit Seoul 2023 | SOCAR는 어떻게 2만대의 차량을 운영할까?: IoT Data의 수집부터 분석까지
PDF
20200630 AWS Black Belt Online Seminar Amazon Cognito
PDF
Black Belt Online Seminar Amazon Cognito
PDF
AWS Tutorial | AWS Certified Solutions Architect | Amazon AWS | AWS Training ...
PPTX
RabbitMQ & Kafka
PPTX
OpenId Connect Protocol
PDF
CloudFrontのリアルタイムログをKibanaで可視化しよう
PDF
금융권을 위한 AWS Direct Connect 기반 하이브리드 구성 방법 - AWS Summit Seoul 2017
PDF
KINX와 함께 하는 AWS Direct Connect 도입 - 남시우 매니저, KINX :: AWS Summit Seoul 2019
PPTX
OpenStack Architecture and Use Cases
PDF
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
PDF
AWS初心者向けWebinar AWSからのEメール送信
PPTX
Cloud testing
PPTX
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
PPTX
What is an API Gateway?
PDF
Helm - Application deployment management for Kubernetes
PDF
A Pattern Language for Microservices
PDF
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
PDF
AWS Black Belt Techシリーズ Amazon WorkDocs / Amazon WorkMail
PPTX
SSL/TLS Introduction with Practical Examples Including Wireshark Captures
AWS Summit Seoul 2023 | SOCAR는 어떻게 2만대의 차량을 운영할까?: IoT Data의 수집부터 분석까지
20200630 AWS Black Belt Online Seminar Amazon Cognito
Black Belt Online Seminar Amazon Cognito
AWS Tutorial | AWS Certified Solutions Architect | Amazon AWS | AWS Training ...
RabbitMQ & Kafka
OpenId Connect Protocol
CloudFrontのリアルタイムログをKibanaで可視化しよう
금융권을 위한 AWS Direct Connect 기반 하이브리드 구성 방법 - AWS Summit Seoul 2017
KINX와 함께 하는 AWS Direct Connect 도입 - 남시우 매니저, KINX :: AWS Summit Seoul 2019
OpenStack Architecture and Use Cases
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
AWS初心者向けWebinar AWSからのEメール送信
Cloud testing
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
What is an API Gateway?
Helm - Application deployment management for Kubernetes
A Pattern Language for Microservices
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
AWS Black Belt Techシリーズ Amazon WorkDocs / Amazon WorkMail
SSL/TLS Introduction with Practical Examples Including Wireshark Captures
Ad

Similar to Rethinking Cloud Proxies (20)

PDF
Edge architecture ieee international conference on cloud engineering
PPTX
Maintaining the Front Door to Netflix : The Netflix API
PDF
Netflix AIM Engineering Manager
PDF
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
PDF
20140708 - Jeremy Edberg: How Netflix Delivers Software
PDF
Netflix’s Success through Technology and Culture - Andicom 2014
PPTX
#NetflixEverywhere Global Architecture
PPTX
Engineering Netflix Global Operations in the Cloud
PDF
API Gateway Deployment Patterns
PDF
Spring Cloud Gateway - Nate Schutta
PPTX
Netflix_AWS_Case_Study_Presentation (1).pptx
PDF
Lovett introducing cloud computing nov 2009
PDF
Zuul_Intro.pdf
PDF
Netflix Velocity Conference 2011
PDF
Scaling Push Messaging for Millions of Devices @Netflix
PDF
Using the Event Gateway To Build Multi-Cloud Serverless Applications - JeffCo...
PPTX
Netflix Women Living on the "Edge" - WiT event
PDF
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
PPTX
02 api gateway
PDF
Edge Engineering Women in Tech Dinner (2018.03.22)
Edge architecture ieee international conference on cloud engineering
Maintaining the Front Door to Netflix : The Netflix API
Netflix AIM Engineering Manager
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
20140708 - Jeremy Edberg: How Netflix Delivers Software
Netflix’s Success through Technology and Culture - Andicom 2014
#NetflixEverywhere Global Architecture
Engineering Netflix Global Operations in the Cloud
API Gateway Deployment Patterns
Spring Cloud Gateway - Nate Schutta
Netflix_AWS_Case_Study_Presentation (1).pptx
Lovett introducing cloud computing nov 2009
Zuul_Intro.pdf
Netflix Velocity Conference 2011
Scaling Push Messaging for Millions of Devices @Netflix
Using the Event Gateway To Build Multi-Cloud Serverless Applications - JeffCo...
Netflix Women Living on the "Edge" - WiT event
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
02 api gateway
Edge Engineering Women in Tech Dinner (2018.03.22)
Ad

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
DOCX
573137875-Attendance-Management-System-original
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
Project quality management in manufacturing
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
web development for engineering and engineering
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Embodied AI: Ushering in the Next Era of Intelligent Systems
Lecture Notes Electrical Wiring System Components
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Well-logging-methods_new................
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
573137875-Attendance-Management-System-original
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Project quality management in manufacturing
additive manufacturing of ss316l using mig welding
CYBER-CRIMES AND SECURITY A guide to understanding
web development for engineering and engineering
Model Code of Practice - Construction Work - 21102022 .pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT

Rethinking Cloud Proxies

Editor's Notes

  • #12: Our gateway strategy will change the way you think about resiliency, debugging, continuous delivery, service operations, and insights.
  • #19: Devices slow to update Need emergency policies Fast action
  • #20: Limited range of functionality Hard to program Authentication Authorization Static responses / Origin specific headers Why? Federation of logic across systems creates complexity Minimize gateway dependencies to maximize availability
  • #24: Origin services run many clusters Route to service clusters based on dynamic routing rules Shape or reject traffic based on service, regional health, or attack React fast in emergencies Realtime analytics and insights Ensures request delivery from internet to services running in the cloud Dynamically changes routing behaviors Routes to services Services have multiple clusters Clusters have dynamically changing nodes Bridges multiple cloud regions and data centers Provides system Insights
  • #25: Same service: Subclusters for many purposes Set up by filters in Zuul Self serviceable by cluster owners Automated Quality assurance / Test Automation Targeted debugging Test Automation Canary / Baseline A/B testing of service behavior per build Squeeze Testing Service capacity testing Trickle traffic Instrumented builds Sticky Canary A/B testing of client behavior per origin build
  • #28: Trickling traffic into clusters High Overhead profiling tools “Coalmine” verbose logging
  • #29: Server capacity testing Gateway gradually increases traffic until performance degradation is detected Automated or manual
  • #30: Isolate requests by customer, route, type of device, or any routing rule Debug node(s) are often instrumented to give verbose logging Custom Request Routing
  • #31: Compare server behavior and metrics Equal traffic rates hit both clusters Automated part of production push process Error rates CPU for equivalent work Automated metrics analysis returns a score of how well the canary cluster performed A poor score stops the push process
  • #32: Servers may be healthy data may be bad API changes that affect devices Data changes certain devices can’t interpret Protocol and transport changes that some devices can’t accept Testing 1000’s of types of devices would be a time consuming, tedious process. Sticky Canary idea - Stick all requests for a small subset of customers for a limited time to a “sticky canary” or “sticky baseline” If servers are equivalent, there should be no behavioral differences. Insights can help find these anomalies Limited scope of impact - a very small subset of customers could be affected but only for a short period of time
  • #37: Reroute to the closer region to the client - DNS accuracy issues, etc Reroute due to region failure.
  • #40: Speedbump Dynamic DDOS prevention