SlideShare a Scribd company logo
PRESENTED BY
Creating a Highly Available Persistent
Session Management Service with Redis
and a Connection Pooling Proxy
Mohamed Elmergawi
Lead Software Engineer, Zulily
2
A NEW STORE EVERY DAY
Thousands of products at brag-worthy prices
INSPIRED, DISCOVERY-DRIVEN EXPERIENCE
without specific purchase intent
HIGHLY CURATED SALES EVENTS
100+ time-limited sales (72 hours)
A DAILY DESTINATION
75% orders via mobile (Q319)
MASSIVELY PERSONALIZED APPROACH
Launch millions of versions of the site/app
daily
GLOBAL MARKETPLACE
15,000+ vendors including Under Armour,
Cuisinart, Melissa & Doug
ZULILY’S BUSINESS CREATES INTERESTING
TECHNICAL CHALLENGES
PRESENTED BY
A reliable global session service is critical:
• If it goes down, you can't serve customers
• Infrastructure is volatile; we need persistence
• Speed is key
“Everything fails all the time” - Werner Vogels, CTO Amazon
Problem Definition
PRESENTED BY
• No HA: a hardware or
network degradation
leads to a failure
• Sharding logic is coupled
in the application level
• Requires manual
intervention to promote
a slave to master
• Limits global expansion
• Idle slave nodes
Legacy Architecture
APP CLUSTER
TWEMPROXY
R/W
REST API
APPLICATION CLUSTER
TWEMPROXY
Master Node
SLAVE NODE
R/W
Async
Replica
SITE CLUSTER
TWEMPROXY
R/W
Master Node
SLAVE NODE
R/W
. . .
. . .
Async
Replica
PRESENTED BY
Redis Cluster
• Not suited for applications that require availability in the event
of large net splits
• Active passive mode
Redis Sentinel
• The sharding logic would still be coupled with the application
• Active passive mode
Alternative Approaches
PRESENTED BY
New Architecture
• Connection Pooling Proxy
• Session Service
• Real-time Replications
Session service
1
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
SITE CLUSTER
PROXY
ALB
APP CLUSTER
PROXY
ALB
Session service
n
Session service
2
. . .
. . . . . .
. . .
. . .
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Region c
Region a Region b
PRESENTED BY
• Reduces the overhead associated with establishing a new
connection
• Leverages existing connections efficiently
• Constrains the total number of connections
Connection Pooling Proxy for Site and App Cluster Nodes
PRESENTED BY
• Request routing based on consistent hashing (Murmur hash)
• Traffic distribution based on GEO location
• Topology aware load balancing (Token Aware)
• Request rerouting based on failed functional or latency health
checks
Session Service
a1
a2a3
0 - 100
101 - 200201 - 300
PRESENTED BY
Session Service
Real-Time Replication between Redis Nodes via Dynomite
P2P and active/active approach
Data Center b
b1
b3b2
Data Center a
a1
a3a2
Data Center c
c1
c3c2
session id 1 hash
session id 2 hash
Incoming write, with persistent hashing
Replication
PRESENTED BY
• Staged rollout
• Double write (Time T1)
• Copied data offline from the slave nodes (Prior to T1)
• Double read
• Data sanity checks
• Apply chaos engineering principles to the new system
Production Rollout
PRESENTED BY
250ms
Recovery window
Results
After simulating an outage on 2 out of 3 network partitions
0.42%
Failure rate
Simulated Outage
PRESENTED BY
• Scale can only happen in multiple hosts
• Higher network traffic volume
• Cross-AZ/Regions/DC traffic costs money
• Adding hosts to the ring is a manual process
Drawbacks
PRESENTED BY
• Connection Pooling Proxy
• Session Service
• Redis is not only a cache, it
is a persistent storage
• Design for failure
• Use Chaos
Engineering practices
• Replicate your data across
multiple regions and use
real time replication
Summary
Session service
1
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
SITE CLUSTER
PROXY
ALB
APP CLUSTER
PROXY
ALB
Session service
n
Session service
2
. . .
. . . . . .
. . .
. . .
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Region c
Region a Region b
Thank You!
zulily.com/careers

More Related Content

PPTX
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
PPTX
Redis Day Bangalore 2020 - Session state caching with redis
PPTX
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
PPTX
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
PPTX
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
PPTX
Build a High-performance Partner Analytics Platform by Ashish Jadhav and Neer...
PPTX
Moving Beyond Cache by Yiftach Shoolman - Redis Day Bangalore 2020
PPTX
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Day Bangalore 2020 - Session state caching with redis
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Build a High-performance Partner Analytics Platform by Ashish Jadhav and Neer...
Moving Beyond Cache by Yiftach Shoolman - Redis Day Bangalore 2020
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database

What's hot (20)

PDF
Kafka for Real-Time Event Processing in Serverless Environments
PDF
The Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven
PDF
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
PPTX
RedisConf17 - Roblox - How Roblox Keeps Millions of Users Up to Date with Red...
PPTX
RedisConf18 - Designing a Redis Client for Humans
PDF
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
PDF
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
PDF
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
PDF
URP? Excuse You! The Three Metrics You Have to Know
PDF
RedisConf17 - Explosion of Data at the Edge in Equinix
PDF
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
PDF
Kafka Summit NYC 2017 - The Real-time Event Driven Bank: A Kafka Story
PDF
Organic Growth and A Good Night Sleep: Effective Kafka Operations at Pinteres...
PDF
Achieving scale and performance using cloud native environment
PDF
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
PDF
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
PPTX
One Click Streaming Data Pipelines & Flows | Leveraging Kafka & Spark | Ido F...
PPTX
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
PPTX
Devops Days, 2019 - Charlotte
Kafka for Real-Time Event Processing in Serverless Environments
The Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
RedisConf17 - Roblox - How Roblox Keeps Millions of Users Up to Date with Red...
RedisConf18 - Designing a Redis Client for Humans
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture
URP? Excuse You! The Three Metrics You Have to Know
RedisConf17 - Explosion of Data at the Edge in Equinix
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
Kafka Summit NYC 2017 - The Real-time Event Driven Bank: A Kafka Story
Organic Growth and A Good Night Sleep: Effective Kafka Operations at Pinteres...
Achieving scale and performance using cloud native environment
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
One Click Streaming Data Pipelines & Flows | Leveraging Kafka & Spark | Ido F...
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Devops Days, 2019 - Charlotte
Ad

Similar to Highly Available Persistent Session Management Service by Mohamed Elmergawi of Zulily - Redis Day Seattle 2020 (20)

PPTX
Redis presentation
PPTX
How Automation And Intelligence Can Simplify Your High Availability
PDF
NetSuite For Manufacturing _ Cloud Manufacturing Software for Modern Manufact...
PDF
管理向云的迁移过程
PPTX
Embracing Failure - AzureDay Rome
PPTX
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
PPTX
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
PDF
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
PDF
Lithium: Event-Driven Network Control
PPTX
Ransomware-Recovery-as-a-Service
PPTX
Contact Center Capabilities
PPTX
Simplifying SDN Networking Across Private and Public Clouds
PDF
2015-04-02 Best in Class Cloud Based Accounting Systems
PPTX
OSSF 2018 - Peter Crocker of Cumulus Networks - TCO and technical advantages ...
PPTX
Application Darwinism - Why Most Enterprise Apps Will Evolve to the Cloud
PDF
70% Improvement in Service and Product Delivery on Implementing DevOps
PPT
gesa_sol.ppt
PPTX
Azure Application Architecture Guide
PPTX
Continuous Delivery of Cloud Applications: Blue/Green and Canary Deployments
Redis presentation
How Automation And Intelligence Can Simplify Your High Availability
NetSuite For Manufacturing _ Cloud Manufacturing Software for Modern Manufact...
管理向云的迁移过程
Embracing Failure - AzureDay Rome
VMworld 2015: Take Virtualization to the Next Level vSphere with Operations M...
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
Lithium: Event-Driven Network Control
Ransomware-Recovery-as-a-Service
Contact Center Capabilities
Simplifying SDN Networking Across Private and Public Clouds
2015-04-02 Best in Class Cloud Based Accounting Systems
OSSF 2018 - Peter Crocker of Cumulus Networks - TCO and technical advantages ...
Application Darwinism - Why Most Enterprise Apps Will Evolve to the Cloud
70% Improvement in Service and Product Delivery on Implementing DevOps
gesa_sol.ppt
Azure Application Architecture Guide
Continuous Delivery of Cloud Applications: Blue/Green and Canary Deployments
Ad

More from Redis Labs (20)

PPTX
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
PPTX
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
PPTX
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
PPTX
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
PPTX
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
PPTX
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
PPTX
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
PPTX
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
PPTX
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
PDF
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
PPTX
Redis as a High Scale Swiss Army Knife by Rahul Dagar and Abhishek Gupta of G...
PPTX
Deploying Redis as a Sidecar in Kubernetes by Janakiram MSV - Redis Day Banga...
PPTX
Real-time GeoSearching at Scale with RediSearch by Apoorva Gaurav and Ronil M...
PPTX
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
PPTX
Accelerating Recommendations at Viu by Amarendra Kumar and Kulbhushan Pachaur...
PPTX
A Low-latency Logging Framework by Rajat Panwar of HolidayMe - Redis Day Bang...
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Redis as a High Scale Swiss Army Knife by Rahul Dagar and Abhishek Gupta of G...
Deploying Redis as a Sidecar in Kubernetes by Janakiram MSV - Redis Day Banga...
Real-time GeoSearching at Scale with RediSearch by Apoorva Gaurav and Ronil M...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Accelerating Recommendations at Viu by Amarendra Kumar and Kulbhushan Pachaur...
A Low-latency Logging Framework by Rajat Panwar of HolidayMe - Redis Day Bang...

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
top salesforce developer skills in 2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPT
Introduction Database Management System for Course Database
PPTX
Essential Infomation Tech presentation.pptx
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
history of c programming in notes for students .pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo POS Development Services by CandidRoot Solutions
Design an Analysis of Algorithms II-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Materi_Pemrograman_Komputer-Looping.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
ISO 45001 Occupational Health and Safety Management System
top salesforce developer skills in 2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Introduction Database Management System for Course Database
Essential Infomation Tech presentation.pptx
ManageIQ - Sprint 268 Review - Slide Deck
Internet Downloader Manager (IDM) Crack 6.42 Build 41
history of c programming in notes for students .pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Operating system designcfffgfgggggggvggggggggg
Materi-Enum-and-Record-Data-Type (1).pptx

Highly Available Persistent Session Management Service by Mohamed Elmergawi of Zulily - Redis Day Seattle 2020

  • 1. PRESENTED BY Creating a Highly Available Persistent Session Management Service with Redis and a Connection Pooling Proxy Mohamed Elmergawi Lead Software Engineer, Zulily
  • 2. 2 A NEW STORE EVERY DAY Thousands of products at brag-worthy prices INSPIRED, DISCOVERY-DRIVEN EXPERIENCE without specific purchase intent HIGHLY CURATED SALES EVENTS 100+ time-limited sales (72 hours) A DAILY DESTINATION 75% orders via mobile (Q319) MASSIVELY PERSONALIZED APPROACH Launch millions of versions of the site/app daily GLOBAL MARKETPLACE 15,000+ vendors including Under Armour, Cuisinart, Melissa & Doug ZULILY’S BUSINESS CREATES INTERESTING TECHNICAL CHALLENGES
  • 3. PRESENTED BY A reliable global session service is critical: • If it goes down, you can't serve customers • Infrastructure is volatile; we need persistence • Speed is key “Everything fails all the time” - Werner Vogels, CTO Amazon Problem Definition
  • 4. PRESENTED BY • No HA: a hardware or network degradation leads to a failure • Sharding logic is coupled in the application level • Requires manual intervention to promote a slave to master • Limits global expansion • Idle slave nodes Legacy Architecture APP CLUSTER TWEMPROXY R/W REST API APPLICATION CLUSTER TWEMPROXY Master Node SLAVE NODE R/W Async Replica SITE CLUSTER TWEMPROXY R/W Master Node SLAVE NODE R/W . . . . . . Async Replica
  • 5. PRESENTED BY Redis Cluster • Not suited for applications that require availability in the event of large net splits • Active passive mode Redis Sentinel • The sharding logic would still be coupled with the application • Active passive mode Alternative Approaches
  • 6. PRESENTED BY New Architecture • Connection Pooling Proxy • Session Service • Real-time Replications Session service 1 Redis + Dynomite Redis + Dynomite Redis + Dynomite Redis + Dynomite SITE CLUSTER PROXY ALB APP CLUSTER PROXY ALB Session service n Session service 2 . . . . . . . . . . . . . . . Redis + Dynomite Redis + Dynomite Redis + Dynomite Redis + Dynomite Redis + Dynomite Region c Region a Region b
  • 7. PRESENTED BY • Reduces the overhead associated with establishing a new connection • Leverages existing connections efficiently • Constrains the total number of connections Connection Pooling Proxy for Site and App Cluster Nodes
  • 8. PRESENTED BY • Request routing based on consistent hashing (Murmur hash) • Traffic distribution based on GEO location • Topology aware load balancing (Token Aware) • Request rerouting based on failed functional or latency health checks Session Service a1 a2a3 0 - 100 101 - 200201 - 300
  • 9. PRESENTED BY Session Service Real-Time Replication between Redis Nodes via Dynomite P2P and active/active approach Data Center b b1 b3b2 Data Center a a1 a3a2 Data Center c c1 c3c2 session id 1 hash session id 2 hash Incoming write, with persistent hashing Replication
  • 10. PRESENTED BY • Staged rollout • Double write (Time T1) • Copied data offline from the slave nodes (Prior to T1) • Double read • Data sanity checks • Apply chaos engineering principles to the new system Production Rollout
  • 11. PRESENTED BY 250ms Recovery window Results After simulating an outage on 2 out of 3 network partitions 0.42% Failure rate Simulated Outage
  • 12. PRESENTED BY • Scale can only happen in multiple hosts • Higher network traffic volume • Cross-AZ/Regions/DC traffic costs money • Adding hosts to the ring is a manual process Drawbacks
  • 13. PRESENTED BY • Connection Pooling Proxy • Session Service • Redis is not only a cache, it is a persistent storage • Design for failure • Use Chaos Engineering practices • Replicate your data across multiple regions and use real time replication Summary Session service 1 Redis + Dynomite Redis + Dynomite Redis + Dynomite Redis + Dynomite SITE CLUSTER PROXY ALB APP CLUSTER PROXY ALB Session service n Session service 2 . . . . . . . . . . . . . . . Redis + Dynomite Redis + Dynomite Redis + Dynomite Redis + Dynomite Redis + Dynomite Region c Region a Region b

Editor's Notes

  • #2: Lead Engineer for E Commence platform  team at ZULILY. I WILL TALK ABOUT HOW AT  ZULIY USED REDIS TO BUILD A HIGHLY AVAILABLE  PERSISTENT SESSION MANAGEMENT  HOW Zulily BUSINESS MODEL CREATED its specific  technical challenges and the role of session management
  • #3: Zulily  business model is all about discovery driven experience ,Our customers comes to site/apps to discover and enjoy liking going to a mall or a boutique Zulily launches a new story every day which is technically launching   millions of  personalized versions of the site/app daily  That translates to specific technical challenges. -Nature of traffic is spikey which means time  warm the cache is not an option. -Speed is critical , -Customer session  flow  is critical for a smooth discovery  and is called per every single  request.
  • #4: A Reliable global session service is critical:  It goes down, you can't serve customers ,As every single request to apps or site requires a session. Infrastructure is volatile; we need persistence  Speed is critical As engineers the main fact we believe in is “Every thing fails , All the time”  Bad code push  Hard ware failure Network Latency Regions /AZ outage. That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$
  • #5: Typical architecture Client Layer  (Apps and Site cluster ) Twem Proxy  (Twemproxy played the role of proxy and connection pooling was deployed on every client machine . ) Application ayer :Sharding logic coupled with the application layer, Session service shared same resource with other application  resources. Customer session lived in (Redis as permanent storage with slave nodes as back ups) Problems 1-Not HA (Losing hardware/network partition will lead to outage) ,  Network Latency will lead degraded experience. 2-Sharding is coupled which limited  scaling and  global expansion off Zulily.We want out data close to our customers. Losing an AWS AZ caused us major outage and degraded experience. As session data  is used for every request to Zulily app.This was not acceptable. That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$
  • #6: That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$
  • #7: Client  Replaced twem  proxy with a custom proxy as client no more directly connect to  Redis  acting as TCP connection pooling  Server Used consistent hashing and abstracted sharding logic and geo location detection to a new service scaling horizontally. Storage  Used Redis as storage layer  distributed across multiple regions in a ring topology for consistent hashing  and we used  dynamite for replication across regions/data centers. Now I will deep dive in every layer the client ,Server and Data layers. --------------------------------- What did we need ? Highly Available, Geo distributed and Scalable Tolerate hardware/partition failures and network degradation Seamless Customer Experience 1,000,000s  of requests
  • #8: Connection Pooling every node in the app and site cluster. Overhead of establishing a new TCP connection collecting metrics (Service Mesh/ Envoy Proxy) Leverages existing connection  Constrains total open connections against load balancer. ------------------------------------------------------------- That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$
  • #10: We got rid of master/slave approach and used P2P  using dynamite a netflix open source project  for replication across regions. Data center definition is just a virtual grouping , regions or az or even on premises Read Request Life cycle, Consistent Hashing by service layer. Service layer will route to the right node in the ring that has the data.(Either going to A1 ,A2 , A3) ------------------------------------------
  • #11: That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$
  • #12: [FA] the actual graph isn't important other than to show latency remained flat, maybe add vertical lines to show when the network was killed... 
  • #13: That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$
  • #14: That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience. In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer. Session management service sharded across multiple  AZs One AZ Outage   % of customers Business Impact == $$$