SlideShare a Scribd company logo
Caching at Netflix: The
Evolution of EVCache
Scott Mansfield
Builderscon 2018
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
EVCache Builderscon
90 seconds
What do caches touch?
Video production*
Search*
Choosing a profile
My List
Personalization*
Loading home page*
CDN selection*
A/B tests
Video encoding selection
Choosing a profile
Viewing title details*
Playing a title*
Subtitle / language prefs
Localization strings
Thumbs up / down
Picking liked videos
Video history*
UI strings
Session management
Video bookmarks
Device history
Device management
Authentication
Scrolling home page*
Title image selection
Geographic location
Signing up*
* multiple caches involved
Home Page
Request
EVCache Builderscon
Key-Value store optimized for
AWS and tuned for Netflix
use cases
Ephemeral Volatile Cache
What is EVCache?
Distributed, sharded, replicated key-value store
Based on Memcached
Tunable in-region and global replication
Resilient to failure
Topology aware
Linearly scalable
Seamless deployments
Why Optimize for AWS
Instances disappear
Zones fail
Regions become unstable
Network is lossy
Customer requests bounce between regions
Failures happen and we test all the time
EVCache Use @ Netflix
3 regions
4 engineers
~100 clusters
~160 servers/cluster
~16,000 servers
~5,000,000 replications/sec
~100,000,000 ops/sec
~400,000,000,000 items
~6,000,000,000,000 ops/day
~3,000,000,000,000,000 bytes
What We're Covering
1) EVCache intro
2) Architecture
3) Use Cases
4) End to End
5) Supporting Features
6) Server Evolution
7) Optimization
Architecture
Architecture
Server
Memcached
EVCar
Application
Client Library
Client
Architecture
us-west-2a us-west-2cus-west-2b
ClientClient Client
Copy 1 Copy 3Copy 2
Reading (get)
us-west-2a us-west-2cus-west-2b
Client
Primary Secondary
Writing (set, delete, add, etc.)
us-west-2a us-west-2cus-west-2b
ClientClient Client
Use Cases
Use Case 1: Lookaside Cache
Application
Client Library
Client HTTP / gRPC Client
S
C
Data Flow
Use Case 2: Transient Data Store
Application
Client Library
Client
Application
Client Library
Client
Application
Client Library
Client
Time
Use Case 3: Primary Store
Offline / Nearline
Precomputes for
Recommendations
Online Services
Offline Services
Online Application
Client Library
Client
Data Flow
Online Services
Offline Services
Use Case 4: Versioned Primary Store
Offline Compute
Online Application
Client Library
Client
Data Flow
Dynamic
Properties
Control System
(Valhalla)
Use Case 5: High Volume && High Availability
Data Flow
Application
Client Library
In-memory Remote HTTP/gRPC Client
Compute & Publish
on schedule
S
C
Optional
Use Case 6: Personalization Fact Store
Playlist Video History
Personalization
Experimentation
PVRPreferences GPS
https://vimeo.com/274954151
End to End
Pipeline of Personalization
Compute A
Compute B Compute C
Compute D
Online Services
Offline Services
Compute E
Data Flow
Online 1 Online 2
Full Circle
Online Services
Unified Logging
Personalization
Experimentation
Supporting Features
Supporting Features
Global data replication
Polyglot clients
Cache warming
Netflix Global Architecture
eu-west-1
us-west-2 us-east-1
NA LATAMAPAC
EMEA
Region BRegion A
APP APP
Repl Proxy
Repl Relay
1 mutate
2 send
metadata
3 poll msg
5
https send
m
sg
6 mutate4
get data
for set
Kafka Repl Relay Kafka
Repl Proxy
Cross-Region Replication
7 read
Polyglot Clients
APP
Java Client
APP
Prana
Local Proxy
Memcached
APP
Memcached Proxy
HTTP
APP
HTTP Proxy
Cache Warming (Deployment)
Cache Warmer
Application
Client Library
Client
Control
S3
Data Flow
Metadata Flow
Control Flow
Server Evolution
Original Server
Server V1
Original Server
● Stock Memcached and EVCar (sidecar)
● All data stored in RAM in Memcached
● Expensive with global expansion / N+1 architecture
Memcached
EVCar
external
Moneta
Server V2
Moneta
The Goddess of Memory
● Global data in cache means many copies
● Access patterns are heavily region-oriented
● Keep hot data in RAM, cold data on SSD
Moneta Server
● Adds Rend proxy and Mnemonic disk storage
● Still looks like Memcached from the outside
Rend
EVCar
Memcached (RAM)
Mnemonic (SSD)
external internal
Rend
● High-performance Memcached proxy
● Manages connections, request orchestration, and
communication
● Low-overhead metrics library
● Multiple orchestrators
● Parallel locking for data integrity
● Efficient connection pool
Server Loop
Request Orchestration
Backend Handlers
M
E
T
R
I
C
S
Connection Management
Protocol
Mnemonic
● Manages data storage on SSD
● Maps Memcached ops to RocksDB ops
● Sharded RocksDB in FIFO mode
Rend Server Core Lib (Go)
Mnemonic Op Handler (Go)
Mnemonic Core (C++)
RocksDB (C++)
Moneta in Production
● Held all of our personalization data
● One port for standard users (read heavy)
● Second port for async replication and batch users
● Maintains working set in RAM
● Optimized for precomputes
○ Replaces in-use data in L1
external internal
EVCar
Memcached (RAM)
Mnemonic (SSD)
Std
Batch
Problem: Latency numbers higher than expected
Evidence: 99.9th percentile of 9.68 ms
Bug: Measuring time between requests, not of each request
Solution: Properly instrument to get real timing
"Optimization" #1
Fix: Batching and multiplexing requests
Problem: High CPU load during peak traffic
Evidence: Profiling shows CPU used on read/write syscalls
Optimization #2
Rend
Memcached
1R
6W
3R
4W
2W
5R
Rend Batching Backend
● Batch requests to Memcached
● Multiplex on connection pool
● Synchronous interface, asynchronous implementation
Relay
1 2 3 ... N
Client connections
1 ... M
Memcached connections
Overview
Sync
Pool
Async
Handler
Wait for responseSubmit request
Relay
Client conn
Client Connection
Relay
1 2 3 ... N
Client connections
Memcached connections
rand()
1 ... M M+1
Monitor
Read conn stats
Wait
Add?
Add
No
Yes
New
Conn
Relay
Pooled
Connection
relay
Batcher
Req
Chan
Reader
Response
Channels
Map
Chan
Batch until
full or timeout
Write batch
Data
Map
(req→res chan)
Read
responses
Match
Client connections
External connection to Memcached
Batching
Optimization #2
Problem: High CPU load during peak traffic
Evidence: Profiling shows CPU used on read/write syscalls
Fix: Batching and multiplexing requests
Result: >15% reduction in CPU usage
Get Percentiles:
● 50th
: 79.3 μs (76.6 μs)
● 75th
: 92.1 μs (84.8 μs)
● 90th
: 112 μs (101 μs)
● 95th
: 140 μs (121 μs)
● 99th
: 424 μs (378 μs)
● 99.5th
: 511 μs (431 μs)
● 99.9th
: 1.28 ms (651 μs)
Moneta Performance in Prod
Set Percentiles:
● 50th
: 68.7 μs (59.5 μs)
● 75th
: 82.3 μs (71.8 μs)
● 90th
: 98.2 μs (84.3 μs)
● 95th
: 107 μs (92.7 μs)
● 99th
: 143 μs (118 μs)
● 99.5th
: 169 μs (131 μs)
● 99.9th
: 1.97 ms (264 μs)
Latencies: peak (trough)
Moneta Performance in Prod
Back to the Future
Server V3
Memcached Extstore
Fantastic improvement to store data on disk, index in RAM
Addition by Dormando, maintainer of Memcached
Get rid of ~18 months * 2 engineers work
No maintenance of proprietary code
Everyone gets improvements (OSS)
https://memcached.org/blog/nvm-caching/
https://github.com/memcached/memcached/wiki/Extstore
Advantages over Moneta
Use 80 - 90% of disk instead of only 50%
Single process
Spectre & Meltdown made syscalls significantly slower
Easier management
Lower latency
Cache warming possible for disk-backed clusters
Enables disk for high write volume and multi-petabyte clusters
Deployment
Nov 2017 Test extstore beta / bug fixes
Dec 2017 First production deployment
Feb 2018 Extstore fully deployed
Apr 2018 New use cases start use
Open Source
https://github.com/netflix/EVCache
https://github.com/netflix/rend
https://github.com/memcached/memcached
Thank You
smansfield@netflix.com
@sgmansfield
techblog.netflix.com
EVCache Builderscon
Backup (and Restore)
Cache Warmer
(Spark)
Application
Client Library
Client
Control
S3Data Flow
Control Flow
Lost Instance Recovery
Cache Warmer
(Spark)
Application
Client Library
Client
S3
Partial Data Flow
Metadata Flow
Control Flow
Control
Zone A Zone B
Data Flow

More Related Content

PDF
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
PDF
Application Caching: The Hidden Microservice (SAConf)
PDF
EVCache & Moneta (GoSF)
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Fastest Servlets in the West
PPTX
... No it's Apache Kafka!
PDF
Flume and HBase
PDF
Handle Large Messages In Apache Kafka
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
Application Caching: The Hidden Microservice (SAConf)
EVCache & Moneta (GoSF)
Producer Performance Tuning for Apache Kafka
Fastest Servlets in the West
... No it's Apache Kafka!
Flume and HBase
Handle Large Messages In Apache Kafka

What's hot (20)

PPTX
Modern Distributed Messaging and RPC
PPTX
PHP Performance with APC + Memcached
PDF
Breaking the Sound Barrier with Persistent Memory
PDF
Let the alpakka pull your stream
PDF
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
PPTX
Flume and Hadoop performance insights
PDF
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PPT
HBase at Xiaomi
PPTX
Peer Cache for Configuration Manager clients
PPTX
Kafka: Internals
PPTX
Proxysql use case scenarios plam 2016
KEY
Load Balancing with Apache
PDF
Apache Kafka - Martin Podval
PDF
Kafka Evaluation - High Throughout Message Queue
PPTX
Extracting twitter data using apache flume
PDF
Apache Flume (NG)
PPT
slides (PPT)
PPTX
Improving Kafka at-least-once performance at Uber
PPTX
WebLogic Stability; Detect and Analyse Stuck Threads
Modern Distributed Messaging and RPC
PHP Performance with APC + Memcached
Breaking the Sound Barrier with Persistent Memory
Let the alpakka pull your stream
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Flume and Hadoop performance insights
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
HBase at Xiaomi
Peer Cache for Configuration Manager clients
Kafka: Internals
Proxysql use case scenarios plam 2016
Load Balancing with Apache
Apache Kafka - Martin Podval
Kafka Evaluation - High Throughout Message Queue
Extracting twitter data using apache flume
Apache Flume (NG)
slides (PPT)
Improving Kafka at-least-once performance at Uber
WebLogic Stability; Detect and Analyse Stuck Threads
Ad

Similar to EVCache Builderscon (20)

PDF
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
PDF
Application Caching: The Hidden Microservice
PPT
MYSQL
PDF
Polyglot persistence @ netflix (CDE Meetup)
PDF
Netflix Open Source Meetup Season 4 Episode 2
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
PPTX
Sangfor HCI - Product Presentation for cloud
PDF
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
PPTX
Dragonflow Austin Summit Talk
PDF
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
PPTX
OpenStack Dragonflow shenzhen and Hangzhou meetups
PPTX
EVCache at Netflix
PDF
Hacking apache cloud stack
PPTX
Event Detection Pipelines with Apache Kafka
PDF
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
PDF
Linux kernel TLS и HTTPS / Александр Крижановский (Tempesta Technologies)
PPTX
High volume real time contiguous etl and audit
PPTX
Running Neutron at Scale - Gal Sagie & Eran Gampel - OpenStack Day Israel 2016
ODP
Clug 2011 March web server optimisation
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Application Caching: The Hidden Microservice
MYSQL
Polyglot persistence @ netflix (CDE Meetup)
Netflix Open Source Meetup Season 4 Episode 2
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
Sangfor HCI - Product Presentation for cloud
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Dragonflow Austin Summit Talk
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
OpenStack Dragonflow shenzhen and Hangzhou meetups
EVCache at Netflix
Hacking apache cloud stack
Event Detection Pipelines with Apache Kafka
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
Linux kernel TLS и HTTPS / Александр Крижановский (Tempesta Technologies)
High volume real time contiguous etl and audit
Running Neutron at Scale - Gal Sagie & Eran Gampel - OpenStack Day Israel 2016
Clug 2011 March web server optimisation
Ad

Recently uploaded (20)

PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Cost to Outsource Software Development in 2025
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
assetexplorer- product-overview - presentation
PDF
medical staffing services at VALiNTRY
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
AutoCAD Professional Crack 2025 With License Key
PPTX
Monitoring Stack: Grafana, Loki & Promtail
Operating system designcfffgfgggggggvggggggggg
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Designing Intelligence for the Shop Floor.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Autodesk AutoCAD Crack Free Download 2025
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
CHAPTER 2 - PM Management and IT Context
Cost to Outsource Software Development in 2025
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Advanced SystemCare Ultimate Crack + Portable (2025)
Computer Software and OS of computer science of grade 11.pptx
assetexplorer- product-overview - presentation
medical staffing services at VALiNTRY
Why Generative AI is the Future of Content, Code & Creativity?
Design an Analysis of Algorithms II-SECS-1021-03
AutoCAD Professional Crack 2025 With License Key
Monitoring Stack: Grafana, Loki & Promtail

EVCache Builderscon