SlideShare a Scribd company logo
“TOO BIG TO FAILOVER”
A cautionary tale of scaling Redis
Aaron Pollack - May 2017
Presentation Summary
2
● How redis is used at napster
● Problems with failover at scale
● Our solution for constant time failovers
‹#›
Napster is still around?
‹#›
● Rhapsody rebranded as Napster last
Spring
● Provides on-demand and radio streaming
for mobile and desktop apps
● Powers on-demand streaming for apps like
iHeartRadio
The cat is back!
5
API.NAPSTER.COM
+
‹#›
NAPSTER API SNAPSHOT
● API Gateway Layer
● 1k developers using the API
● 70m request/day
● 7k Redis ops/sec
‹#›
We LOVE Redis (mostly)
● Fast! - Response times <10ms to Redis cluster
with network round trip included.
● Simple - Built in data types translate easily into
JS. Replication comes free.
● Available - Redis is mission critical for us. When
it’s down, we’re down.
Architected for Speed
8
So What’s The Problem?
9
So What’s The Problem?
10
● Redis server and sentinel share the same host
So What’s The Problem?
11
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
So What’s The Problem?
12
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
● Sending all read traffic to slaves means that you have downtime
during failover
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover
13
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
14
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
15
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
16
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory (8 seconds)
7. Slave serves traffic
Steps in Failover (1GB in Memory)
17
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory (8 seconds)
7. Slave serves traffic
Steps in Failover (1GB in Memory)
18
Total Time: ~1.5 minutes
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (40 seconds)
5. Master syncs data to existing slaves (122 seconds)
6. Data is loaded into memory (43 seconds)
7. Slave serves traffic
Steps in Failover (5GB in Memory)
19
Total Time: 3 minutes
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (181 seconds)
5. Master syncs data to existing slaves (305 seconds)
6. Data is loaded into memory (238 seconds)
7. Slave serves traffic
Steps in Failover (20GB in Memory)
20
Total Time: ~12.5 minutes
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (243 seconds)
5. Master syncs data to existing slaves (425 seconds)
6. Data is loaded into memory (354 seconds)
7. Slave serves traffic
Steps in Failover (40GB in Memory)
21
Total Time: ~18 minutes
RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis
23
Slaves Become Unreachable During Failover
1. What is causing the failover?
2. Why is the data growing so quickly?
Investigation
24
1. Out of memory
1. What’s causing the failover?
25
1. Out of memory
2. Saturated client connections
1. What’s causing the failover?
26
1. Out of memory
2. Saturated client connections
3. Gremlins
1. What’s causing the failover?
27
1. Can you control the growth
of data?
2. If you can’t control it, at least
monitor it!
3. Think about data in terms of
volatile vs non-volatile
2. Why is the data growing so quickly?
28
1. Connection Pooling!
a. https://github.com/luin/ioredis
2. Fast fail if connection is not ready
3. Backoff strategy for retry
3. How can we be better clients of Redis?
29
ioredis
30
https://github.com/luin/ioredis
https://www.npmjs.com/package/ioredis
Client Singleton
31
Tuning ioredis Config
32
1. keepAlive - 0 (by default) enable connection pooling to redis
2. connectTimeout - milliseconds before a timeout occurs during the
initial connection to the Redis server
3. enableReadyCheck - wait for server to load database from disk before
sending commands
4. retryStrategy - wait an increasing amount of time with each connection
attempt.
1. Volatile vs non-volatile
a. Are you setting a ttl on keys?
2. What data is accessed the most?
4. Build your redis env around your data
33
Client Initializer
34
Architected for Availability
THANK YOU
Me:
apollack@napster.com
github.com/lolpack
lolpack.me
‹#›
Napster API Team:
@napsterAPI
Links:
White Paper: lolpack.me/rediswhitepaper.pdf
Try out Napster: order.napster.com/developer
API Docs: developer.napster.com

More Related Content

PDF
Counting image views using redis cluster
PDF
Redis acl
PDF
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
PDF
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
PPTX
Tips from Support: Always Carry a Towel and Don’t Panic!
PPTX
Day 2 General Session Presentations RedisConf
PPT
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
PDF
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Counting image views using redis cluster
Redis acl
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
Tips from Support: Always Carry a Towel and Don’t Panic!
Day 2 General Session Presentations RedisConf
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...

What's hot (18)

PDF
Automatic Operation Bot for Ceph - You Ji
PDF
Global deduplication for Ceph - Myoungwon Oh
PDF
Solving some of the scalability problems at booking.com
PDF
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
PDF
Running Cloud Foundry for 12 months - An experience report | anynines
PDF
[En] IPVS for Docker Containers
PPTX
Full Stack Load Testing
PPTX
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
PPT
SaltConf14 - Brendan Burns, Google - Management at Google Scale
PPT
Apache Traffic Server
PDF
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
PDF
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
PPTX
Experience Report: Cloud Foundry Open Source Operations | anynines
ODP
JRuby - Everything in a single process
PPTX
SaltConf 2014: Safety with powertools
PDF
Chaos Engineering for Docker
PPT
Os Webb
PPT
How Typepad changed their architecture without taking down the service
Automatic Operation Bot for Ceph - You Ji
Global deduplication for Ceph - Myoungwon Oh
Solving some of the scalability problems at booking.com
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Running Cloud Foundry for 12 months - An experience report | anynines
[En] IPVS for Docker Containers
Full Stack Load Testing
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
SaltConf14 - Brendan Burns, Google - Management at Google Scale
Apache Traffic Server
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
Experience Report: Cloud Foundry Open Source Operations | anynines
JRuby - Everything in a single process
SaltConf 2014: Safety with powertools
Chaos Engineering for Docker
Os Webb
How Typepad changed their architecture without taking down the service
Ad

Similar to RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis (20)

PDF
Reaching reliable agreement in an unreliable world
PPTX
Resilience reloaded - more resilience patterns
PDF
Practical Byzantine Fault Tolernace
PDF
Practical Byzantine Fault Tolerance
PPT
Lecture07_FaultTolerance in parallel and distributing
PPT
Lecture07_FaultTolerance in parallel and distributed
PDF
Atifalhas
PDF
HA Summary
PPT
Fault Tolerance (Distributed computing)
PDF
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
PDF
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
PDF
Webinar slides: How to Manage Replication Failover Processes for MySQL, Maria...
PPTX
fault tolerance1.pptx
PPTX
Redis Clustering Advanced___31Mar2025.pptx
PDF
Latency Control And Supervision In Resilience Design Patterns
PPT
fault-tolerance-slide.ppt
PPT
Sinfonia
PDF
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
PPT
9 fault-tolerance
PPTX
Unit_4_Fault_Tolerance.pptx
Reaching reliable agreement in an unreliable world
Resilience reloaded - more resilience patterns
Practical Byzantine Fault Tolernace
Practical Byzantine Fault Tolerance
Lecture07_FaultTolerance in parallel and distributing
Lecture07_FaultTolerance in parallel and distributed
Atifalhas
HA Summary
Fault Tolerance (Distributed computing)
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Webinar slides: How to Manage Replication Failover Processes for MySQL, Maria...
fault tolerance1.pptx
Redis Clustering Advanced___31Mar2025.pptx
Latency Control And Supervision In Resilience Design Patterns
fault-tolerance-slide.ppt
Sinfonia
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
9 fault-tolerance
Unit_4_Fault_Tolerance.pptx
Ad

More from Redis Labs (20)

PPTX
Redis Day Bangalore 2020 - Session state caching with redis
PPTX
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
PPTX
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
PPTX
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
PPTX
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
PPTX
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
PPTX
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
PPTX
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
PPTX
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
PPTX
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
PPTX
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
PPTX
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
PPTX
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
PPTX
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
PDF
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
PPTX
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Day Bangalore 2020 - Session state caching with redis
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

  • 1. “TOO BIG TO FAILOVER” A cautionary tale of scaling Redis Aaron Pollack - May 2017
  • 2. Presentation Summary 2 ● How redis is used at napster ● Problems with failover at scale ● Our solution for constant time failovers
  • 4. ‹#› ● Rhapsody rebranded as Napster last Spring ● Provides on-demand and radio streaming for mobile and desktop apps ● Powers on-demand streaming for apps like iHeartRadio The cat is back!
  • 6. ‹#› NAPSTER API SNAPSHOT ● API Gateway Layer ● 1k developers using the API ● 70m request/day ● 7k Redis ops/sec
  • 7. ‹#› We LOVE Redis (mostly) ● Fast! - Response times <10ms to Redis cluster with network round trip included. ● Simple - Built in data types translate easily into JS. Replication comes free. ● Available - Redis is mission critical for us. When it’s down, we’re down.
  • 9. So What’s The Problem? 9
  • 10. So What’s The Problem? 10 ● Redis server and sentinel share the same host
  • 11. So What’s The Problem? 11 ● Redis server and sentinel share the same host ● Four sentinels a. An even number means that there is a chance for ties if quorum is 2
  • 12. So What’s The Problem? 12 ● Redis server and sentinel share the same host ● Four sentinels a. An even number means that there is a chance for ties if quorum is 2 ● Sending all read traffic to slaves means that you have downtime during failover
  • 13. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated 3. A new slave is elected master 4. New master does full BGSAVE 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover 13
  • 14. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 14
  • 15. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 15
  • 16. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 16
  • 17. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory (8 seconds) 7. Slave serves traffic Steps in Failover (1GB in Memory) 17
  • 18. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory (8 seconds) 7. Slave serves traffic Steps in Failover (1GB in Memory) 18 Total Time: ~1.5 minutes
  • 19. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (40 seconds) 5. Master syncs data to existing slaves (122 seconds) 6. Data is loaded into memory (43 seconds) 7. Slave serves traffic Steps in Failover (5GB in Memory) 19 Total Time: 3 minutes
  • 20. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (181 seconds) 5. Master syncs data to existing slaves (305 seconds) 6. Data is loaded into memory (238 seconds) 7. Slave serves traffic Steps in Failover (20GB in Memory) 20 Total Time: ~12.5 minutes
  • 21. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (243 seconds) 5. Master syncs data to existing slaves (425 seconds) 6. Data is loaded into memory (354 seconds) 7. Slave serves traffic Steps in Failover (40GB in Memory) 21 Total Time: ~18 minutes
  • 23. 23 Slaves Become Unreachable During Failover
  • 24. 1. What is causing the failover? 2. Why is the data growing so quickly? Investigation 24
  • 25. 1. Out of memory 1. What’s causing the failover? 25
  • 26. 1. Out of memory 2. Saturated client connections 1. What’s causing the failover? 26
  • 27. 1. Out of memory 2. Saturated client connections 3. Gremlins 1. What’s causing the failover? 27
  • 28. 1. Can you control the growth of data? 2. If you can’t control it, at least monitor it! 3. Think about data in terms of volatile vs non-volatile 2. Why is the data growing so quickly? 28
  • 29. 1. Connection Pooling! a. https://github.com/luin/ioredis 2. Fast fail if connection is not ready 3. Backoff strategy for retry 3. How can we be better clients of Redis? 29
  • 32. Tuning ioredis Config 32 1. keepAlive - 0 (by default) enable connection pooling to redis 2. connectTimeout - milliseconds before a timeout occurs during the initial connection to the Redis server 3. enableReadyCheck - wait for server to load database from disk before sending commands 4. retryStrategy - wait an increasing amount of time with each connection attempt.
  • 33. 1. Volatile vs non-volatile a. Are you setting a ttl on keys? 2. What data is accessed the most? 4. Build your redis env around your data 33
  • 37. ‹#› Napster API Team: @napsterAPI Links: White Paper: lolpack.me/rediswhitepaper.pdf Try out Napster: order.napster.com/developer API Docs: developer.napster.com

Editor's Notes

  • #2: Issues my team faced while scaling redis in production
  • #4: Address the elephant in the room
  • #6: How we use Redis I work on the team that provides the public facing API for napster We use redis to store information about our developers and to authenticate our users
  • #7: 70 million that fall through the cache We store data about our developers, some user data, but mostly token sets as part of the Oauth flow
  • #11: If you lose one, you lose the other. You are subject to the 28K port limit
  • #12: A quorum of 3 when you only have 4 sentinels can delay the time it takes to elect a new master.
  • #14: Once the new master is elected, it can immediately handle writes
  • #15: The default of 30 seconds allows for network hiccups and any other event that might trigger an unnecessary failover. We’ve tried to tune this down to decrease overall failover time and if it’s too short it becomes too sensitive
  • #19: When developing with small data sets it’s almost unnoticeable
  • #20: Authenticated calls are failing Some health checks are failing By the time you have been alerted and look at the problem it’s fixed itself
  • #22: Unacceptable amount of downtime A restart won’t do anything for you. You are at the mercy of the time it takes to sync.
  • #23: - Can anyone else who has been on call relate?
  • #24: There is a linear correlation between data growth and the time it takes a slave to recover and become readable. BGSAVE doubles memory Perfect storm of connections piling up, bgsave memory issue and tokens not expiring fast enough
  • #25: The dust has settled and now it’s time to investigate the issue
  • #26: Set a maxmemory and key expire policy Key expiry policy only works for ephemeral data or if you are willing to lose persisted data
  • #27: Make sure your app/client is not making a bad problem worse for redis by re-establishing connections as soon as they fail
  • #28: Systems will fail, so building redundancy into critical systems is essential
  • #29: We are at the mercy of our client’s implementation of Oauth Monitoring usage allows us to proactively reach out to developers so they understand how the API should be used and we don’t have to store extra data We found a client was requesting a new Auth Token before each authenticated call We have to allow all new token sets in and don’t have a way of eagerly expiring old refresh tokens Developer data has to stay, ephemeral data like refresh tokens can go
  • #30: Switched NPM packages to ioredis and have never looked back There was a bug in our old package where it wouldn’t kill the old connection after a failed redis lookup Hit 28K port limit during redis outage
  • #31: Finally, some code! Create a global client referenced in the function to create a JS singleton
  • #32: Finally, some code! Create a global client referenced in the function to create a JS singleton Ensures any place we require redis throughout the app is using the same connection
  • #33: Key Configuration: `role: master` These configs are helpful during problem or outage situations Enable offline queue is dangerous for us - the only time we are offline is during an outage, so queueing up requests is not doing us any favors Retry Strategy: Good for network outages or failovers
  • #34: Redis is so fast and flexible, you may not consider volatility vs space issues We we’re storing critical data with ephemeral data
  • #36: The speed is not too shaby either: we can still auth a user in <50ms with backend roundtrip included. We traded some performance, but not too much No redis downtime since split Easy upgrades (30 second failover)
  • #38: You can go to order.napster.com/developer and get a free 6 month trial of Napster. Build an app with our APIs and then tweet at us, we would love to see what you come up with!