SlideShare a Scribd company logo
10
Most read
15
Most read
18
Most read
Preventing cache
stampede with Redis &
XFetch
Jim Nelson <jnelson@archive.org>
Internet Archive
RedisConf 2017
Internet Archive
Universal Access to All Knowledge
Founded 1996, based in San Francisco
Archive of digital and physical media
Includes Web, books, music, film, software & more
Digital holdings: over 30 petabytes & counting
Key collections & services:
Wayback Machine
Grateful Dead live concert collection
Internet Archive ♡ Redis
Caching & other services backed by 10-node sharded Redis cluster
Sharding performed client-side via consistent hashing (PHP, Predis)
Each node supported by two replicated mirrors (fail-over)
Specialized Redis instances also used throughout IA’s services, including
Wayback, search, and more
Caching: Quick terminology
I assume we all know what caching is. This is the terminology I’ll use today:
Recompute: Expensive operation whose result is cached
(database query, file system read, HTTP request to remote service)
Expiration: When a cache value is considered stale or out-of-date
(time-to-live)
Evict: Removing a value from the cache
(to forcibly invalidate a value prior to expiry)
Cache stampede
Cache stampede
“A cache stampede is a type of cascading failure that can
occur when massively parallel computing systems with
caching mechanisms come under very high load. This
behaviour is sometimes also called dog-piling.”
–Wikipedia
https://en.wikipedia.org/wiki/Cache_stampede
Cache stampede: A scenario
Multiple servers, each with multiple workers serving requests, accessing a
common cached value
When the cached value expires or is evicted, all workers experience a
simultaneous cache miss
Workers recompute the missing value, causing overload of primary data
sources (e.g. database) and/or hung requests
Congestion collapse
Hung workers due to network congestion or expensive recomputes—that’s bad
Discarded user requests—that’s bad
Overloaded primary data stores (“Sources of Truth”)—that’s bad
Harmonics (peaks & valleys): brief periods of intense activity (mini-outages)
followed by lulls—that’s bad
Imagine a cached value with TTL of 1hr enjoying 10,000 hits/sec—that’s good.
Now imagine @ 1hr+1sec 10,000 cache misses —that’s bad.
Typical cache code
function fetch(name)
var data = redis.get(name)
if (!data)
data = recompute(name)
redis.set(name, expires, data)
return data
This “looks” fine, but consider tens of thousands of simultaneous workers calling this code at once:
no mutual exclusion, no upper-bound to simultaneous recomputes or writes … that’s a cache stampede
Typical stampede solutions
(a) Locking
One worker acquires lock, recomputes, and writes value to cache
Other workers wait for lock to be released, then retry cache read
Primary data source is not overloaded by requests
Redis is often used as a cluster-wide distributed lock:
https://redis.io/topics/distlock
Problems with locking
Introduces extra reads and writes into code path
Starvation: expiration / eviction can lead to blocked workers waiting for a
single worker to finish recompute
Distributed locks may be abandoned
Typical stampede solutions
(b) External recompute
Use a separate process / independent worker to recompute value
Workers never recompute
(Alternately, workers recompute as fall-back when external process fails)
Problems with external recompute
One more “moving part”—a daemon, a cron job, work stealing
Requires fall-back scheme if external recompute fails to run
External recomputation is often not easily deterministic:
caching based on a wide variety of user input
periodic external recomputation of 1,000,000 user records
External recomputation may be inefficient if cached values are never read by
XFetch
(Probabilistic early recomputation)
Probabilistic early recomputation (PER)
Recompute cache values before they expire
Before expiration, one worker “volunteers” to recompute the value
Without evicting old value, volunteer performs expensive recompute—
other workers continue reading cache
Before expiration, volunteer writes new cache value and extends its
time-to-live
Under ideal conditions, there are no cache misses
XFetch
Full paper title: “Optimal Probabilistic Cache Stampede Prevention”
Authors:
Andrea Vattani (Goodreads)
Flavio Chierichetti (Sapienza University)
Keegan Lowenstein (Bugsnag)
Archived at IA:
https://archive.org/details/xfetch
The algorithm
XFetch (“exponential fetch”) is elegant:
delta * beta * loge(rand())
where
delta – Time to recompute value
beta – control (default: 1.0, > 1.0 favors earlier recomputation, < 1.0 favors later)
rand – Random number [ 0.0 … 1.0 ]
Remember: log(0) to log(1) is negative, so XFetch produces negative value
Updated code
function fetch(name)
var data,delta,ttl = redis.get(name, delta, ttl)
if (!data or xfetch(delta, time() + ttl))
var data,recompute_time = recompute(name)
redis.set(name, expires, data), redis.set(delta, expires, recompute_time)
return data
function xfetch(delta, expiry)
/* XFetch is negative; value is being added to time() */
return time() - (delta * BETA * log(rand(0,1))) >= expiry
Can more than one volunteer recompute?
Yes. You should know this before using XFetch.
It’s possible for more than one worker to “roll” the magic number and start a
recompute. The odds of this occurring increase as the expiration deadline
approaches.
If your data source absolutely cannot be accessed by multiple workers, use a
lock or another sentinel—XFetch will minimize lock contention
How to determine delta?
XFetch must be supplied with the time required to recompute.
The easiest approach is to store the duration of the last recompute and read it
with the cached value.
What’s the deal with the beta value?
beta is the one knob you have to tweak XFetch.
beta > 1.0 favors earlier recomputation, < 1.0 favors later recomputation.
My suggestion: Start with the default (1.0), instrument your code, and change
only if necessary.
XFetch & Redis
Let’s look at some sample
code
Questions?
Redis & XFetch
Jim Nelson <jnelson@archive.org>
Internet Archive
RedisConf 2017

More Related Content

PDF
Microservices with Java, Spring Boot and Spring Cloud
PDF
The BlackBox Project: Safely store secrets in Git/Mercurial (originally for P...
PDF
[오픈소스컨설팅] 스카우터 사용자 가이드 2020
PDF
Data Engineering 101
PDF
Aurora MySQL Backtrack을 이용한 빠른 복구 방법 - 진교선 :: AWS Database Modernization Day 온라인
PDF
AWS における サーバーレスの基礎からチューニングまで
PDF
아마존 웹 서비스 상에서 MS SQL 100% 활용하기::김석원::AWS Summit Seoul 2018
PDF
AWS Black Belt Online Seminar AWS Key Management Service (KMS)
Microservices with Java, Spring Boot and Spring Cloud
The BlackBox Project: Safely store secrets in Git/Mercurial (originally for P...
[오픈소스컨설팅] 스카우터 사용자 가이드 2020
Data Engineering 101
Aurora MySQL Backtrack을 이용한 빠른 복구 방법 - 진교선 :: AWS Database Modernization Day 온라인
AWS における サーバーレスの基礎からチューニングまで
아마존 웹 서비스 상에서 MS SQL 100% 활용하기::김석원::AWS Summit Seoul 2018
AWS Black Belt Online Seminar AWS Key Management Service (KMS)

What's hot (20)

PDF
20201028 AWS Black Belt Online Seminar Amazon CloudFront deep dive
PDF
Spring Security
KEY
Introduction to memcached
PDF
AWS Fault Injection Simulator를 통한 실전 카오스 엔지니어링 - 윤석찬 AWS 수석 테크에반젤리스트 / 김신 SW엔...
PDF
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
PDF
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
PDF
DevOps with Database on AWS
PDF
20190514 AWS Black Belt Online Seminar Amazon API Gateway
PDF
도메인 주도 설계의 본질
PDF
Microservice With Spring Boot and Spring Cloud
PDF
Secrets of Performance Tuning Java on Kubernetes
PDF
サーバPUSHざっくりまとめ
PDF
커머스 스타트업의 효율적인 데이터 분석 플랫폼 구축기 - 하지양 데이터 엔지니어, 발란 / 강웅석 데이터 엔지니어, 크로키닷컴 :: AWS...
PDF
채팅서버의 부하 분산 사례
PPTX
Apache Kafka Security
PPTX
Introduction to Microservices
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
Loki - like prometheus, but for logs
PDF
옵저버빌러티(Observability) 확보로 서버리스 마이크로서비스 들여다보기 - 김형일 AWS 솔루션즈 아키텍트 :: AWS Summi...
PDF
AWS Aurora 운영사례 (by 배은미)
20201028 AWS Black Belt Online Seminar Amazon CloudFront deep dive
Spring Security
Introduction to memcached
AWS Fault Injection Simulator를 통한 실전 카오스 엔지니어링 - 윤석찬 AWS 수석 테크에반젤리스트 / 김신 SW엔...
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
DevOps with Database on AWS
20190514 AWS Black Belt Online Seminar Amazon API Gateway
도메인 주도 설계의 본질
Microservice With Spring Boot and Spring Cloud
Secrets of Performance Tuning Java on Kubernetes
サーバPUSHざっくりまとめ
커머스 스타트업의 효율적인 데이터 분석 플랫폼 구축기 - 하지양 데이터 엔지니어, 발란 / 강웅석 데이터 엔지니어, 크로키닷컴 :: AWS...
채팅서버의 부하 분산 사례
Apache Kafka Security
Introduction to Microservices
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Loki - like prometheus, but for logs
옵저버빌러티(Observability) 확보로 서버리스 마이크로서비스 들여다보기 - 김형일 AWS 솔루션즈 아키텍트 :: AWS Summi...
AWS Aurora 운영사례 (by 배은미)
Ad

Similar to RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch (20)

PDF
Voldemort Nosql
PDF
Don’t give up, You can... Cache!
PPTX
Redis
PPT
Waters Grid & HPC Course
PDF
Work Stealing For Fun & Profit: Jim Nelson
PDF
PyConline AU 2021 - Things might go wrong in a data-intensive application
PDF
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
PPTX
Introduction to Redis and its features.pptx
PDF
Using Riak for Events storage and analysis at Booking.com
PPTX
RedisConf18 - Techniques for Synchronizing In-Memory Caches with Redis
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
PDF
Scaling SQL Write-Master Database Clusters With Redis Labs: Erik Brandsberg
PPTX
Techniques to Improve Cache Speed
PPTX
Software architecture for data applications
PDF
The Economics of Scale: Promises and Perils of Going Distributed
PDF
The Computer Science Behind a modern Distributed Database
PDF
10 Things Learned Releasing Databricks Enterprise Wide
PPT
Big Data & NoSQL - EFS'11 (Pavlo Baron)
PPTX
Load balancing theory and practice
PDF
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
Voldemort Nosql
Don’t give up, You can... Cache!
Redis
Waters Grid & HPC Course
Work Stealing For Fun & Profit: Jim Nelson
PyConline AU 2021 - Things might go wrong in a data-intensive application
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
Introduction to Redis and its features.pptx
Using Riak for Events storage and analysis at Booking.com
RedisConf18 - Techniques for Synchronizing In-Memory Caches with Redis
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Scaling SQL Write-Master Database Clusters With Redis Labs: Erik Brandsberg
Techniques to Improve Cache Speed
Software architecture for data applications
The Economics of Scale: Promises and Perils of Going Distributed
The Computer Science Behind a modern Distributed Database
10 Things Learned Releasing Databricks Enterprise Wide
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Load balancing theory and practice
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
Ad

More from Redis Labs (20)

PPTX
Redis Day Bangalore 2020 - Session state caching with redis
PPTX
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
PPTX
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
PPTX
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
PPTX
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
PPTX
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
PPTX
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
PPTX
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
PPTX
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
PPTX
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
PPTX
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
PPTX
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
PPTX
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
PPTX
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
PDF
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
PPTX
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Day Bangalore 2020 - Session state caching with redis
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology

RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch

  • 1. Preventing cache stampede with Redis & XFetch Jim Nelson <jnelson@archive.org> Internet Archive RedisConf 2017
  • 2. Internet Archive Universal Access to All Knowledge Founded 1996, based in San Francisco Archive of digital and physical media Includes Web, books, music, film, software & more Digital holdings: over 30 petabytes & counting Key collections & services: Wayback Machine Grateful Dead live concert collection
  • 3. Internet Archive ♡ Redis Caching & other services backed by 10-node sharded Redis cluster Sharding performed client-side via consistent hashing (PHP, Predis) Each node supported by two replicated mirrors (fail-over) Specialized Redis instances also used throughout IA’s services, including Wayback, search, and more
  • 4. Caching: Quick terminology I assume we all know what caching is. This is the terminology I’ll use today: Recompute: Expensive operation whose result is cached (database query, file system read, HTTP request to remote service) Expiration: When a cache value is considered stale or out-of-date (time-to-live) Evict: Removing a value from the cache (to forcibly invalidate a value prior to expiry)
  • 6. Cache stampede “A cache stampede is a type of cascading failure that can occur when massively parallel computing systems with caching mechanisms come under very high load. This behaviour is sometimes also called dog-piling.” –Wikipedia https://en.wikipedia.org/wiki/Cache_stampede
  • 7. Cache stampede: A scenario Multiple servers, each with multiple workers serving requests, accessing a common cached value When the cached value expires or is evicted, all workers experience a simultaneous cache miss Workers recompute the missing value, causing overload of primary data sources (e.g. database) and/or hung requests
  • 8. Congestion collapse Hung workers due to network congestion or expensive recomputes—that’s bad Discarded user requests—that’s bad Overloaded primary data stores (“Sources of Truth”)—that’s bad Harmonics (peaks & valleys): brief periods of intense activity (mini-outages) followed by lulls—that’s bad Imagine a cached value with TTL of 1hr enjoying 10,000 hits/sec—that’s good. Now imagine @ 1hr+1sec 10,000 cache misses —that’s bad.
  • 9. Typical cache code function fetch(name) var data = redis.get(name) if (!data) data = recompute(name) redis.set(name, expires, data) return data This “looks” fine, but consider tens of thousands of simultaneous workers calling this code at once: no mutual exclusion, no upper-bound to simultaneous recomputes or writes … that’s a cache stampede
  • 10. Typical stampede solutions (a) Locking One worker acquires lock, recomputes, and writes value to cache Other workers wait for lock to be released, then retry cache read Primary data source is not overloaded by requests Redis is often used as a cluster-wide distributed lock: https://redis.io/topics/distlock
  • 11. Problems with locking Introduces extra reads and writes into code path Starvation: expiration / eviction can lead to blocked workers waiting for a single worker to finish recompute Distributed locks may be abandoned
  • 12. Typical stampede solutions (b) External recompute Use a separate process / independent worker to recompute value Workers never recompute (Alternately, workers recompute as fall-back when external process fails)
  • 13. Problems with external recompute One more “moving part”—a daemon, a cron job, work stealing Requires fall-back scheme if external recompute fails to run External recomputation is often not easily deterministic: caching based on a wide variety of user input periodic external recomputation of 1,000,000 user records External recomputation may be inefficient if cached values are never read by
  • 15. Probabilistic early recomputation (PER) Recompute cache values before they expire Before expiration, one worker “volunteers” to recompute the value Without evicting old value, volunteer performs expensive recompute— other workers continue reading cache Before expiration, volunteer writes new cache value and extends its time-to-live Under ideal conditions, there are no cache misses
  • 16. XFetch Full paper title: “Optimal Probabilistic Cache Stampede Prevention” Authors: Andrea Vattani (Goodreads) Flavio Chierichetti (Sapienza University) Keegan Lowenstein (Bugsnag) Archived at IA: https://archive.org/details/xfetch
  • 17. The algorithm XFetch (“exponential fetch”) is elegant: delta * beta * loge(rand()) where delta – Time to recompute value beta – control (default: 1.0, > 1.0 favors earlier recomputation, < 1.0 favors later) rand – Random number [ 0.0 … 1.0 ] Remember: log(0) to log(1) is negative, so XFetch produces negative value
  • 18. Updated code function fetch(name) var data,delta,ttl = redis.get(name, delta, ttl) if (!data or xfetch(delta, time() + ttl)) var data,recompute_time = recompute(name) redis.set(name, expires, data), redis.set(delta, expires, recompute_time) return data function xfetch(delta, expiry) /* XFetch is negative; value is being added to time() */ return time() - (delta * BETA * log(rand(0,1))) >= expiry
  • 19. Can more than one volunteer recompute? Yes. You should know this before using XFetch. It’s possible for more than one worker to “roll” the magic number and start a recompute. The odds of this occurring increase as the expiration deadline approaches. If your data source absolutely cannot be accessed by multiple workers, use a lock or another sentinel—XFetch will minimize lock contention
  • 20. How to determine delta? XFetch must be supplied with the time required to recompute. The easiest approach is to store the duration of the last recompute and read it with the cached value.
  • 21. What’s the deal with the beta value? beta is the one knob you have to tweak XFetch. beta > 1.0 favors earlier recomputation, < 1.0 favors later recomputation. My suggestion: Start with the default (1.0), instrument your code, and change only if necessary.
  • 22. XFetch & Redis Let’s look at some sample code
  • 24. Redis & XFetch Jim Nelson <jnelson@archive.org> Internet Archive RedisConf 2017