RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

“TOO BIG TO FAILOVER”
A cautionary tale of scaling Redis
Aaron Pollack - May 2017

Presentation Summary
2
● How redis is used at napster
● Problems with failover at scale
● Our solution for constant time failovers

‹#›
Napster is still around?

‹#›
● Rhapsody rebranded as Napster last
Spring
● Provides on-demand and radio streaming
for mobile and desktop apps
● Powers on-demand streaming for apps like
iHeartRadio
The cat is back!

‹#›
NAPSTER API SNAPSHOT
● API Gateway Layer
● 1k developers using the API
● 70m request/day
● 7k Redis ops/sec

‹#›
We LOVE Redis (mostly)
● Fast! - Response times <10ms to Redis cluster
with network round trip included.
● Simple - Built in data types translate easily into
JS. Replication comes free.
● Available - Redis is mission critical for us. When
it’s down, we’re down.

So What’s The Problem?
10
● Redis server and sentinel share the same host

11
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2

12
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
● Sending all read traffic to slaves means that you have downtime
during failover

1. Master is unreachable
2. Sentinels reach quorum and failover is initiated
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover
13

2. Sentinels reach quorum and failover is initiated (30 seconds)
4. New master does full BGSAVE
Steps in Failover (1GB in Memory)
14

4. New master does full BGSAVE (9 seconds)
15

5. Master syncs data to existing slaves (39 seconds)
16

6. Data is loaded into memory (8 seconds)
17

18
Total Time: ~1.5 minutes

19
Total Time: 3 minutes

20
Total Time: ~12.5 minutes

21
Total Time: ~18 minutes

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

23
Slaves Become Unreachable During Failover

1. What is causing the failover?
2. Why is the data growing so quickly?
Investigation
24

1. Out of memory
1. What’s causing the failover?
25

1. Out of memory
2. Saturated client connections
26

1. Out of memory
2. Saturated client connections
3. Gremlins
27

1. Can you control the growth
of data?
2. If you can’t control it, at least
monitor it!
3. Think about data in terms of
volatile vs non-volatile
2. Why is the data growing so quickly?
28

1. Connection Pooling!
a. https://github.com/luin/ioredis
2. Fast fail if connection is not ready
3. Backoff strategy for retry
3. How can we be better clients of Redis?
29

ioredis
30
https://github.com/luin/ioredis
https://www.npmjs.com/package/ioredis

Tuning ioredis Config
32
1. keepAlive - 0 (by default) enable connection pooling to redis
2. connectTimeout - milliseconds before a timeout occurs during the
initial connection to the Redis server
3. enableReadyCheck - wait for server to load database from disk before
sending commands
4. retryStrategy - wait an increasing amount of time with each connection
attempt.

1. Volatile vs non-volatile
a. Are you setting a ttl on keys?
2. What data is accessed the most?
4. Build your redis env around your data
33

THANK YOU
Me:
apollack@napster.com
github.com/lolpack
lolpack.me

‹#›
Napster API Team:
@napsterAPI
Links:
White Paper: lolpack.me/rediswhitepaper.pdf
Try out Napster: order.napster.com/developer
API Docs: developer.napster.com

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

More Related Content

What's hot (18)

Similar to RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis (20)

More from Redis Labs (20)

Recently uploaded (20)

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

Editor's Notes