SlideShare a Scribd company logo
Redis at LINE
25 Billion Messages Per Day
Jongyeol Choi
LINE+ Corporation
S p e a k e r
• Jongyeol Choi
• Software engineer
• Lives in South Korea
• Works on Redis team at LINE
• Previously worked at Samsung Electronics
• Contributed to Netty (netty-codec-redis), Lettuce, etc.
A g e n d a
• LINE
• Storage systems for LINE Messaging System
• In-house Redis Cluster
• Scalable monitoring system
• Experiences with the official Redis Cluster
• Asynchronous Redis client
• Current challenges and future work
L I N E
• Messaging service
• 168 million active users in Japan,
Taiwan, Thailand, and Indonesia.
• 25 billion messages per day
• 420,000 messages sent per second at
peak
• Many family services
• News, Music, LIVE (video
streaming), Games, more
L I N E M e s s a g i n g S y s t e m
• Messaging server
• Most messaging features
• Java8, Spring, Thrift, Tomcat, and Armeria
• Asynchronous task processor systems
• New system backed by Kafka clusters
• Another old system backed by Redis queue
process per messaging server machine
• Other related components
A
P
I
G
A
T
E
W
A
Y
M e s s a g i n g
S e r v e r
A s y n c h ro n o u s
Ta s k P ro c e s s o r
C
L
I
E
N
T …
…
R e d i s
H B a s e
…
K a f k a
…
S t o r a g e S y s t e m s f o r L I N E M e s s a g i n g
• Redis
• Cache or Primary Storage
• HBase
• Backup Storage or Primary Storage
• Kafka
• For asynchronous processing
• Previous presentations about HBase and Kafka
• "HBase at LINE 2017" at LINE Developer Day 2017
• "Kafka at LINE" at Kafka Summit San Francisco 2017
A
P
I
G
A
T
E
W
A
Y
C
L
I
E
N
T …
…
R e d i s
H B a s e
…
K a f k a
…
M e s s a g i n g
S e r v e r
A s y n c h ro n o u s
Ta s k P ro c e s s o r
R e d i s u s a g e s f o r L I N E M e s s a g i n g
• Redis versions: 2.6, 2.8, 3.0, 3.2
• 60+ Redis clusters (In-house Redis clusters + Official Redis clusters)
• 1,000+ physical machines (8-12 Redis nodes per machine)
• Each machine: 10–20 cores (20–40 threads) / 192–256 GB memory
• 10,000+ Redis nodes (Max operations per second per node < 100,000)
• 370+ billion Redis keys and 120+ TB data in our Redis clusters
• Some clusters have 1,000–3,000 nodes in each cluster including slave nodes
I n - h o u s e R e d i s C l u s t e r
• Client-side sharding without proxy
• Sharding rules
• Fixed size ring or consistent hashing
• In-house facility implementations
• Cluster Manager Server + UI (Redhand)
• LINE Redis Client (w/ Jedis, Lettuce and Java)
• Redis Cluster Monitor (Scala, Akka)
C l u s t e r M a n a g e r S e r v e r
J a v a
A p p l i c a t i o n
L I N E
R e d i s
C l i e n t
HealthCheck
Sync
Update
Z o o K e e p e r
R e d i s C l u s t e r M o n i t o r
Monitoring
for statistics
master slave
shard-1
Cluster
shard-2
shard-3
P ro s / C o n s o f P ro x y - l e s s ( C l i e n t s h a rd i n g )
• Pros
• Short latency
• Average response time is 100–200 μs

(Messaging needs many storage I/Os in an API call)
• Cost efficiency
• Don’t need thousands of proxy servers
• Cons
• Client implementation is language dependent
• Fat client. Hard to maintain/release the client to all
related server systems
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
P ro x y
P ro x y
F a i l o v e r f o r I n - h o u s e R e d i s C l u s t e r
• Cluster types and data types
• Cache (master only) or storage (master/slave)
• Immutable or Mutable
• Cluster-Manager-Server sends PING to all Redis
nodes every 2 seconds
• When a master doesn’t respond
• Cache: Failure state → Use origin storage
• Storage: Slave becomes the new master
• Applications will “eventually” get updated
cluster information from ZooKeeper
C l u s t e r M a n a g e r S e r v e r
Update
Z o o K e e p e r
shard-1
Sync
PING
A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n
shard-2
F a i l o v e r f o r M u t a b l e d a t a
a t I n - h o u s e R e d i s C l u s t e r
• Failover with Mutable data
• Recovery Mode
• A client-side solution
• Delay all Redis commands to target
shard for few seconds
• Each Redis server node doesn’t know
each other (= Cannot use redirection)
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
R e c o v e r y
M o d e
shard-1 shard-2
R e s i z i n g a t i n - h o u s e R e d i s C l u s t e r
• Sharding rule: Consistent hashing
• Dynamic resizing, Immutable & cached data only
• Hard to balance data distribution
• Sharding rule: Fixed size ring
• No dynamic resizing, migration only
• Easy to implement & easy to balance
• Migrate data to new cluster online and in the
background
• Migration takes several days when data is large
O l d C l u s t e r
N e w C l u s t e r
A p p l i c a t i o n
Backgroundmigration
R e l i a b i l i t y o f R e d i s a s P r i m a r y S t o r a g e
• What if?
• What if both master and slave of a shard are down at the same time?
• Reboot all Redis servers if data center loses power?
• RDB or AOF?
• It affects average Redis’ response time
• Persistent storage?
• Adopted HBase
• Read/write important data from/to both Redis and HBase
D u a l Wr i t e a n d R e a d H A
( H i g h Av a i l a b i l i t y )
• Dual write
1. Write data to Redis first
2. Write data to HBase in the background
• With another thread or using Kafka
• Read HA
1. Send a read request to Redis first
2. Wait for response for a few hundred microseconds.

If no Redis response is received (rare case),
3. Send the read request to HBase concurrently
4. Use the response returned first, regardless of the
sender (usually the Redis response comes first).
Dual write
A p p l i c a t i o n
Read HA
A p p l i c a t i o n
A s y n c h ro n o u s
Ta s k P ro c e s s o r
H o t k e y
• Hot key results in:
• Command bursting, connection bursting
• Slowlogs
• Slower response time for applications
• How to avoid?
• Write intensive: Re-design key
• Read intensive: Use multiple slaves or
multiple clusters
Command bursting case
Count of commands per second per process
R e p l i c a t e d c l u s t e r
• To increase read scalability
• Used for specific purposes only
• Cache cluster
• Immutable data
• Long-lived data
• Chooses a cluster randomly
• Uses origin storage for failover
• Warming up a cluster takes several days
C l u s t e r- 1
C l u s t e r- 2
A p p l i c a t i o n
Random
C l u s t e r- 3
C l u s t e r- 4
C l u s t e r- 5
C l u s t e r- N
O r i g i n S t o r a g e
Fallback
T i m e o u t
• Sometimes, a Redis shard or machine slows down or
crashes
• Single bad Redis command affecting a big collection
• Command bursting caused by hot keys
• Various hardware failures (e.g. ECC memory failure)
• Waiting and timeout
• Waiting for millions of Redis responses can trigger an
outage (busy threads or a full request queue)
• Short timeout is important
• Is timeout enough?
I’m
busy!!
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
I’m
waiting!
I’m
waiting!
I’m
waiting!
C i rc u i t b re a k e r f o r f a s t f a i l u re
• Adopted a circuit breaker named Anticipator
for “important” clusters
• Aims to predict failures and not bother Redis
servers when they are busy
• When response time increases, it temporarily
marks the target shard as failed
• Applications declare failure for the request,
without sending the requests to the shard
• The shard will return to the normal state
after a short period of time and testing
Anticipator:

“You seem busy.

We are not sending any Redis request to you for now.”
A p p l i c a t i o n
A p p l i c a t i o n
A p p l i c a t i o n
R e d i s C l u s t e r M o n i t o r
S c a l a b l e M o n i t o r i n g S y s t e m
• Redis Cluster Monitor, an in-house monitoring system
• Gathers metrics with second precision
• Scala, Akka, Elastic Search, Kibana, Grafana
• A resilient and scalable Akka cluster
• Monitors 10,000+ Redis instances
• INFO to all nodes every second
• For network bandwidth, distributes INFO request
timing per node for response timing from all nodes
• View aggregated information on Grafana
…N o d e N o d e
← 1 sec → ← 1 sec →
A u t o m a t i c b u r s t i n g d e t e c t i o n
• To find command/connection bursting
• When “Redis Cluster Monitor” finds bursting
patterns
• Captures associated Redis commands
• Stores the commands into Elasticsearch
• Command, key, parameter, host:port,
client’s IP, count, and more
• Developers view the information on Kibana,

find problematic commands, and fix the cause
Detected bursting results in Kibana
E x p e r i e n c e s w i t h o ff i c i a l R e d i s C l u s t e r 3 . 2
• Used Redis Cluster 3.2 (not 4.x)
• Why “official” Redis Cluster?
• Dynamic resizing with mutable data by server side clustering
• Community standard (right?)
• So,
• Applied it on some clusters
• We met some issues
• When replacing a “master” machine?

(e.g. Memory ECC warning, disk failure)
• Killing a master? → Client failure continues from 20 sec to 1min
• Manual failover by CLUSTER FAILOVER command
• PSYNC v1 → Full-sync → Some client failures
• Workaround
• Move slots to other masters → FAILOVER → Move slots back
• Takes long and a lot of rehashing
R e d i s C l u s t e r 3 . 2 : R e p l i c a t i o n a n d o p e r a t i o n
A B
BA
CLUSTER
FAILOVER
R e d i s C l u s t e r 3 . 2 : M o re u s e d m e m o r y
• Needs more than standalone Redis
• key-slot in ZSET
• https://github.com/antirez/redis/issues/3800
• 2 x memory → Requires 2 x machines ?
• 10,000+ shards → 20,000+ shards ?
• 4.x uses RAX, but still uses more memory
than standalone
Type Version
Used Memory
(GB)
In-house (standalone) 3.2.11 5.91
Redis Cluster 3.2.11 11.69
In-house (standalone) 4.0.8 5.91
Redis Cluster 4.0.8 9.43
Example of a Redis server process with 56M keys
(jemalloc-4.0.3)
Max Memory: 16GB
R e d i s C l u s t e r 3 . 2 : M a x n o d e s s i z e
• “High performance and linear scalability up to 1,000 nodes.”
• The recommended max nodes size <= 1,000
• Some large clusters have 1,000–3,000 nodes (shards) now
• Size > 1,000 ? → The gossip traffic eat lots of network bandwidth
• Solutions
• Separating data to another cluster
• Client-side cluster sharding
P ro s / C o n s o f e a c h s y s t e m
Strength Weakness
Proxy Light client
Longer latency
Requires more servers for proxy
Client-Sharding
Short latency
Cost efficiency
Resizing is hard
Fat client
Official Redis Cluster Dynamic resizing is easy
Requires more memory and more servers
Limited number of maximum nodes
A s y n c h ro n o u s R e d i s C l i e n t
• Some LINE developers need a Redis client for our in-house Redis clusters

in their asynchronous application servers with RxJava and Armeria
• But Jedis 2.x doesn’t support asynchronous I/O
• Committed netty-codec-redis (codec only, not a client) to Netty repo
• Adopted Lettuce for LINE Redis client
• https://github.com/lettuce-io/lettuce-core
• Added implementations for in-house requirements:

client sharding, checking timeouts for each request, monitoring metrics, anticipator,
replicated cluster, external reconnection, etc
• Committed fixes and improvements to Lettuce repo
Synchronous I/O
Asynchronous I/O
Waiting
Waiting
L a t e n c y o f a s y n c h ro n o u s R e d i s C l i e n t ( 1 )
• Redis servers processes are fast
• Client-side latency can be a big part of the whole latency
• Average latency for a single request on an empty Linux machine
• Jedis: 16 μs
• Lettuce (Netty+epoll): 31 μs
• Differences?
• Jedis 2.9: sendto(), poll(), recvfrom() in same thread
• Lettuce 4.4.x: write(), epoll_wait(), read() + and more futex()
• ByteBuf and Netty’s buffer management, JNI (epoll), …
Nanoseconds
Jedis 2.9
Lettuce
4.4.x
netty-codec-redis
L a t e n c y o f a s y n c h ro n o u s R e d i s C l i e n t ( 2 )
• How about production?
• Differences depend on JVM state, threads, GC, and others
• Longer average response time
• More response time peaks
Average response time on client side (μs) Latency peaks on client side (μs)
Lettuce
Jedis
Jedis
Lettuce
T h ro u g h p u t o f a s y n c h ro n o u s R e d i s C l i e n t
• Good throughput
• Pros/Cons
• Sync: Short latency/Thread blocking
• Async: High throughput/Long latency
• Adopting Lettuce for asynchronous server modules
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
1 2 4 6 8 10 12
Throughput	(Requests	per	second)
JEDIS LETTUCE
Jedis
Lettuce
C u r re n t c h a l l e n g e s a n d f u t u re w o r k
• Migrating more in-house clusters to official Redis Cluster 3.2
• Improving cluster management system for both in-house and official
clusters
• Adopting Lettuce-based clients more and more, for asynchronous systems
• Testing and trying Redis Cluster 4.x
• And more: Reducing hot keys, automating operations, reducing storage
clusters, reducing mutable data, and more
W E ’ R E H I R I N G
• https://career.linecorp.com/linecorp/career/list
• https://linecorp.com/ja/career/ja/all
• http://recruit.linepluscorp.com/lineplus/career/list
Thank You

More Related Content

PDF
NGINX ADC: Basics and Best Practices – EMEA
PDF
High Concurrency Architecture at TIKI
PPTX
Microservices Architecture & Testing Strategies
PPTX
Cassandra Lunch #88: Cadence
PPTX
High throughput data replication over RAFT
PDF
Consumer offset management in Kafka
PDF
The Dual write problem
PPTX
NGINX ADC: Basics and Best Practices – EMEA
High Concurrency Architecture at TIKI
Microservices Architecture & Testing Strategies
Cassandra Lunch #88: Cadence
High throughput data replication over RAFT
Consumer offset management in Kafka
The Dual write problem

What's hot (20)

PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Grokking TechTalk #31: Asynchronous Communications
PDF
Benefits of Stream Processing and Apache Kafka Use Cases
PDF
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
PPTX
RocksDB detail
PDF
Apache Kafka - Martin Podval
PDF
Grokking TechTalk #33: High Concurrency Architecture at TIKI
PDF
gRPC Design and Implementation
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
PDF
Domain Driven Design và Event Driven Architecture
PPTX
Microservices Part 3 Service Mesh and Kafka
PDF
Cloud Native Landscape (CNCF and OCI)
PPTX
Improving Kafka at-least-once performance at Uber
PDF
Producer Performance Tuning for Apache Kafka
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
Apache Kafka Best Practices
ODP
Introduction to Nginx
PDF
Kafka At Scale in the Cloud
Flexible and Real-Time Stream Processing with Apache Flink
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Grokking TechTalk #31: Asynchronous Communications
Benefits of Stream Processing and Apache Kafka Use Cases
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
RocksDB detail
Apache Kafka - Martin Podval
Grokking TechTalk #33: High Concurrency Architecture at TIKI
gRPC Design and Implementation
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Domain Driven Design và Event Driven Architecture
Microservices Part 3 Service Mesh and Kafka
Cloud Native Landscape (CNCF and OCI)
Improving Kafka at-least-once performance at Uber
Producer Performance Tuning for Apache Kafka
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Best Practices
Introduction to Nginx
Kafka At Scale in the Cloud
Ad

Similar to RedisConf18 - Redis at LINE - 25 Billion Messages Per Day (20)

PDF
HIgh Performance Redis- Tague Griffith, GoPro
PDF
Navigating Transactions: ACID Complexity in Modern Databases
PDF
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
PDF
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
PDF
rspamd-slides
PDF
Scaling tappsi
PDF
Diagnosing Problems in Production - Cassandra
KEY
London devops logging
PPTX
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Advanced Operations
PDF
Speed up your Symfony2 application and build awesome features with Redis
PDF
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
KEY
Sphinx at Craigslist in 2012
PDF
ApacheCon BigData - What it takes to process a trillion events a day?
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
HIgh Performance Redis- Tague Griffith, GoPro
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
rspamd-slides
Scaling tappsi
Diagnosing Problems in Production - Cassandra
London devops logging
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Diagnosing Problems in Production (Nov 2015)
Advanced Operations
Speed up your Symfony2 application and build awesome features with Redis
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Sphinx at Craigslist in 2012
ApacheCon BigData - What it takes to process a trillion events a day?
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Ad

More from Redis Labs (20)

PPTX
Redis Day Bangalore 2020 - Session state caching with redis
PPTX
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
PPTX
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
PPTX
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
PPTX
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
PPTX
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
PPTX
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
PPTX
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
PPTX
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
PPTX
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
PPTX
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
PPTX
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
PPTX
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
PPTX
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
PDF
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
PPTX
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Day Bangalore 2020 - Session state caching with redis
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

RedisConf18 - Redis at LINE - 25 Billion Messages Per Day

  • 1. Redis at LINE 25 Billion Messages Per Day Jongyeol Choi LINE+ Corporation
  • 2. S p e a k e r • Jongyeol Choi • Software engineer • Lives in South Korea • Works on Redis team at LINE • Previously worked at Samsung Electronics • Contributed to Netty (netty-codec-redis), Lettuce, etc.
  • 3. A g e n d a • LINE • Storage systems for LINE Messaging System • In-house Redis Cluster • Scalable monitoring system • Experiences with the official Redis Cluster • Asynchronous Redis client • Current challenges and future work
  • 4. L I N E • Messaging service • 168 million active users in Japan, Taiwan, Thailand, and Indonesia. • 25 billion messages per day • 420,000 messages sent per second at peak • Many family services • News, Music, LIVE (video streaming), Games, more
  • 5. L I N E M e s s a g i n g S y s t e m • Messaging server • Most messaging features • Java8, Spring, Thrift, Tomcat, and Armeria • Asynchronous task processor systems • New system backed by Kafka clusters • Another old system backed by Redis queue process per messaging server machine • Other related components A P I G A T E W A Y M e s s a g i n g S e r v e r A s y n c h ro n o u s Ta s k P ro c e s s o r C L I E N T … … R e d i s H B a s e … K a f k a …
  • 6. S t o r a g e S y s t e m s f o r L I N E M e s s a g i n g • Redis • Cache or Primary Storage • HBase • Backup Storage or Primary Storage • Kafka • For asynchronous processing • Previous presentations about HBase and Kafka • "HBase at LINE 2017" at LINE Developer Day 2017 • "Kafka at LINE" at Kafka Summit San Francisco 2017 A P I G A T E W A Y C L I E N T … … R e d i s H B a s e … K a f k a … M e s s a g i n g S e r v e r A s y n c h ro n o u s Ta s k P ro c e s s o r
  • 7. R e d i s u s a g e s f o r L I N E M e s s a g i n g • Redis versions: 2.6, 2.8, 3.0, 3.2 • 60+ Redis clusters (In-house Redis clusters + Official Redis clusters) • 1,000+ physical machines (8-12 Redis nodes per machine) • Each machine: 10–20 cores (20–40 threads) / 192–256 GB memory • 10,000+ Redis nodes (Max operations per second per node < 100,000) • 370+ billion Redis keys and 120+ TB data in our Redis clusters • Some clusters have 1,000–3,000 nodes in each cluster including slave nodes
  • 8. I n - h o u s e R e d i s C l u s t e r • Client-side sharding without proxy • Sharding rules • Fixed size ring or consistent hashing • In-house facility implementations • Cluster Manager Server + UI (Redhand) • LINE Redis Client (w/ Jedis, Lettuce and Java) • Redis Cluster Monitor (Scala, Akka) C l u s t e r M a n a g e r S e r v e r J a v a A p p l i c a t i o n L I N E R e d i s C l i e n t HealthCheck Sync Update Z o o K e e p e r R e d i s C l u s t e r M o n i t o r Monitoring for statistics master slave shard-1 Cluster shard-2 shard-3
  • 9. P ro s / C o n s o f P ro x y - l e s s ( C l i e n t s h a rd i n g ) • Pros • Short latency • Average response time is 100–200 μs
 (Messaging needs many storage I/Os in an API call) • Cost efficiency • Don’t need thousands of proxy servers • Cons • Client implementation is language dependent • Fat client. Hard to maintain/release the client to all related server systems A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n P ro x y P ro x y
  • 10. F a i l o v e r f o r I n - h o u s e R e d i s C l u s t e r • Cluster types and data types • Cache (master only) or storage (master/slave) • Immutable or Mutable • Cluster-Manager-Server sends PING to all Redis nodes every 2 seconds • When a master doesn’t respond • Cache: Failure state → Use origin storage • Storage: Slave becomes the new master • Applications will “eventually” get updated cluster information from ZooKeeper C l u s t e r M a n a g e r S e r v e r Update Z o o K e e p e r shard-1 Sync PING A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n shard-2
  • 11. F a i l o v e r f o r M u t a b l e d a t a a t I n - h o u s e R e d i s C l u s t e r • Failover with Mutable data • Recovery Mode • A client-side solution • Delay all Redis commands to target shard for few seconds • Each Redis server node doesn’t know each other (= Cannot use redirection) A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n R e c o v e r y M o d e shard-1 shard-2
  • 12. R e s i z i n g a t i n - h o u s e R e d i s C l u s t e r • Sharding rule: Consistent hashing • Dynamic resizing, Immutable & cached data only • Hard to balance data distribution • Sharding rule: Fixed size ring • No dynamic resizing, migration only • Easy to implement & easy to balance • Migrate data to new cluster online and in the background • Migration takes several days when data is large O l d C l u s t e r N e w C l u s t e r A p p l i c a t i o n Backgroundmigration
  • 13. R e l i a b i l i t y o f R e d i s a s P r i m a r y S t o r a g e • What if? • What if both master and slave of a shard are down at the same time? • Reboot all Redis servers if data center loses power? • RDB or AOF? • It affects average Redis’ response time • Persistent storage? • Adopted HBase • Read/write important data from/to both Redis and HBase
  • 14. D u a l Wr i t e a n d R e a d H A ( H i g h Av a i l a b i l i t y ) • Dual write 1. Write data to Redis first 2. Write data to HBase in the background • With another thread or using Kafka • Read HA 1. Send a read request to Redis first 2. Wait for response for a few hundred microseconds.
 If no Redis response is received (rare case), 3. Send the read request to HBase concurrently 4. Use the response returned first, regardless of the sender (usually the Redis response comes first). Dual write A p p l i c a t i o n Read HA A p p l i c a t i o n A s y n c h ro n o u s Ta s k P ro c e s s o r
  • 15. H o t k e y • Hot key results in: • Command bursting, connection bursting • Slowlogs • Slower response time for applications • How to avoid? • Write intensive: Re-design key • Read intensive: Use multiple slaves or multiple clusters Command bursting case Count of commands per second per process
  • 16. R e p l i c a t e d c l u s t e r • To increase read scalability • Used for specific purposes only • Cache cluster • Immutable data • Long-lived data • Chooses a cluster randomly • Uses origin storage for failover • Warming up a cluster takes several days C l u s t e r- 1 C l u s t e r- 2 A p p l i c a t i o n Random C l u s t e r- 3 C l u s t e r- 4 C l u s t e r- 5 C l u s t e r- N O r i g i n S t o r a g e Fallback
  • 17. T i m e o u t • Sometimes, a Redis shard or machine slows down or crashes • Single bad Redis command affecting a big collection • Command bursting caused by hot keys • Various hardware failures (e.g. ECC memory failure) • Waiting and timeout • Waiting for millions of Redis responses can trigger an outage (busy threads or a full request queue) • Short timeout is important • Is timeout enough? I’m busy!! A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n I’m waiting! I’m waiting! I’m waiting!
  • 18. C i rc u i t b re a k e r f o r f a s t f a i l u re • Adopted a circuit breaker named Anticipator for “important” clusters • Aims to predict failures and not bother Redis servers when they are busy • When response time increases, it temporarily marks the target shard as failed • Applications declare failure for the request, without sending the requests to the shard • The shard will return to the normal state after a short period of time and testing Anticipator:
 “You seem busy.
 We are not sending any Redis request to you for now.” A p p l i c a t i o n A p p l i c a t i o n A p p l i c a t i o n
  • 19. R e d i s C l u s t e r M o n i t o r S c a l a b l e M o n i t o r i n g S y s t e m • Redis Cluster Monitor, an in-house monitoring system • Gathers metrics with second precision • Scala, Akka, Elastic Search, Kibana, Grafana • A resilient and scalable Akka cluster • Monitors 10,000+ Redis instances • INFO to all nodes every second • For network bandwidth, distributes INFO request timing per node for response timing from all nodes • View aggregated information on Grafana …N o d e N o d e ← 1 sec → ← 1 sec →
  • 20. A u t o m a t i c b u r s t i n g d e t e c t i o n • To find command/connection bursting • When “Redis Cluster Monitor” finds bursting patterns • Captures associated Redis commands • Stores the commands into Elasticsearch • Command, key, parameter, host:port, client’s IP, count, and more • Developers view the information on Kibana,
 find problematic commands, and fix the cause Detected bursting results in Kibana
  • 21. E x p e r i e n c e s w i t h o ff i c i a l R e d i s C l u s t e r 3 . 2 • Used Redis Cluster 3.2 (not 4.x) • Why “official” Redis Cluster? • Dynamic resizing with mutable data by server side clustering • Community standard (right?) • So, • Applied it on some clusters • We met some issues
  • 22. • When replacing a “master” machine?
 (e.g. Memory ECC warning, disk failure) • Killing a master? → Client failure continues from 20 sec to 1min • Manual failover by CLUSTER FAILOVER command • PSYNC v1 → Full-sync → Some client failures • Workaround • Move slots to other masters → FAILOVER → Move slots back • Takes long and a lot of rehashing R e d i s C l u s t e r 3 . 2 : R e p l i c a t i o n a n d o p e r a t i o n A B BA CLUSTER FAILOVER
  • 23. R e d i s C l u s t e r 3 . 2 : M o re u s e d m e m o r y • Needs more than standalone Redis • key-slot in ZSET • https://github.com/antirez/redis/issues/3800 • 2 x memory → Requires 2 x machines ? • 10,000+ shards → 20,000+ shards ? • 4.x uses RAX, but still uses more memory than standalone Type Version Used Memory (GB) In-house (standalone) 3.2.11 5.91 Redis Cluster 3.2.11 11.69 In-house (standalone) 4.0.8 5.91 Redis Cluster 4.0.8 9.43 Example of a Redis server process with 56M keys (jemalloc-4.0.3) Max Memory: 16GB
  • 24. R e d i s C l u s t e r 3 . 2 : M a x n o d e s s i z e • “High performance and linear scalability up to 1,000 nodes.” • The recommended max nodes size <= 1,000 • Some large clusters have 1,000–3,000 nodes (shards) now • Size > 1,000 ? → The gossip traffic eat lots of network bandwidth • Solutions • Separating data to another cluster • Client-side cluster sharding
  • 25. P ro s / C o n s o f e a c h s y s t e m Strength Weakness Proxy Light client Longer latency Requires more servers for proxy Client-Sharding Short latency Cost efficiency Resizing is hard Fat client Official Redis Cluster Dynamic resizing is easy Requires more memory and more servers Limited number of maximum nodes
  • 26. A s y n c h ro n o u s R e d i s C l i e n t • Some LINE developers need a Redis client for our in-house Redis clusters
 in their asynchronous application servers with RxJava and Armeria • But Jedis 2.x doesn’t support asynchronous I/O • Committed netty-codec-redis (codec only, not a client) to Netty repo • Adopted Lettuce for LINE Redis client • https://github.com/lettuce-io/lettuce-core • Added implementations for in-house requirements:
 client sharding, checking timeouts for each request, monitoring metrics, anticipator, replicated cluster, external reconnection, etc • Committed fixes and improvements to Lettuce repo Synchronous I/O Asynchronous I/O Waiting Waiting
  • 27. L a t e n c y o f a s y n c h ro n o u s R e d i s C l i e n t ( 1 ) • Redis servers processes are fast • Client-side latency can be a big part of the whole latency • Average latency for a single request on an empty Linux machine • Jedis: 16 μs • Lettuce (Netty+epoll): 31 μs • Differences? • Jedis 2.9: sendto(), poll(), recvfrom() in same thread • Lettuce 4.4.x: write(), epoll_wait(), read() + and more futex() • ByteBuf and Netty’s buffer management, JNI (epoll), … Nanoseconds Jedis 2.9 Lettuce 4.4.x netty-codec-redis
  • 28. L a t e n c y o f a s y n c h ro n o u s R e d i s C l i e n t ( 2 ) • How about production? • Differences depend on JVM state, threads, GC, and others • Longer average response time • More response time peaks Average response time on client side (μs) Latency peaks on client side (μs) Lettuce Jedis Jedis Lettuce
  • 29. T h ro u g h p u t o f a s y n c h ro n o u s R e d i s C l i e n t • Good throughput • Pros/Cons • Sync: Short latency/Thread blocking • Async: High throughput/Long latency • Adopting Lettuce for asynchronous server modules 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 1 2 4 6 8 10 12 Throughput (Requests per second) JEDIS LETTUCE Jedis Lettuce
  • 30. C u r re n t c h a l l e n g e s a n d f u t u re w o r k • Migrating more in-house clusters to official Redis Cluster 3.2 • Improving cluster management system for both in-house and official clusters • Adopting Lettuce-based clients more and more, for asynchronous systems • Testing and trying Redis Cluster 4.x • And more: Reducing hot keys, automating operations, reducing storage clusters, reducing mutable data, and more
  • 31. W E ’ R E H I R I N G • https://career.linecorp.com/linecorp/career/list • https://linecorp.com/ja/career/ja/all • http://recruit.linepluscorp.com/lineplus/career/list