SlideShare a Scribd company logo
©2015 Azul Systems, Inc.	 	 	 	 	 	
How NOT to
Measure Latency
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations/
latency-response-time
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
©2015 Azul Systems, Inc.	 	 	 	 	 	
The “Oh S@%#!” talk
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
©2015 Azul Systems, Inc.	 	 	 	 	 	
About me: Gil Tene
co-founder, CTO @Azul
Systems

Have been working on
“think different” GC
approaches since 2002

A Long history building
Virtual & Physical
Machines, Operating
Systems, Enterprise apps,
etc...

I also depress people by
pulling the wool up from
over their eyes…
* working on real-world trash compaction issues, circa 2004
©2015 Azul Systems, Inc.
©2015 Azul Systems, Inc.	 	 	 	 	 	
Latency Behavior
Latency: The time it took one operation to happen

Each operation occurrence has its own latency

What we care about is how latency behaves

Behavior is a lot more than “the common case was X”
©2015 Azul Systems, Inc.	 	 	 	 	 	
We like to look at pretty charts…
95%’lie
The “We only want to show good things” chart
©2015 Azul Systems, Inc.
©2015 Azul Systems, Inc.	 	 	 	 	 	
A real world, real time example
©2015 Azul Systems, Inc.	 	 	 	 	 	
A real world, real time example
©2015 Azul Systems, Inc.	 	 	 	 	 	
A real world, real time example
So this is a better picture. Right?
©2015 Azul Systems, Inc.	 	 	 	 	 	
Why do we tend to avoid plotting Max latency?
Because no other %’ile will be visible on the same chart…
©2015 Azul Systems, Inc.	 	 	 	 	 	
I like to rant about latency…
©2015 Azul Systems, Inc.	 	 	 	 	 	
#LatencyTipOfTheDay:
If you are not measuring and/or
plotting Max, what are you hiding
(from)?
©2015 Azul Systems, Inc.
©2015 Azul Systems, Inc.	 	 	 	 	 	
What (TF) does the Average
of the 95%’lie mean?
©2015 Azul Systems, Inc.	 	 	 	 	 	
What (TF) does the Average
of the 95%’lie mean?
Lets do the same with 100%’ile; Suppose we a set of
100%’ile values for each minute:

[1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2]

“The average 100%’ile over the past 15 minutes was 42”

Same nonsense applies to any other %’lie
©2015 Azul Systems, Inc.	 	 	 	 	 	
#LatencyTipOfTheDay:
You can't average percentiles.
Period.
©2015 Azul Systems, Inc.	 	 	 	 	 	
Percentiles Matter
©2015 Azul Systems, Inc.	 	 	 	 	 	
Is the 99%’lie “rare”?
©2015 Azul Systems, Inc.	 	 	 	 	 	
99%’lie: a good indicator, right?
What are the chances of a single web page
view experiencing >99%’lie latency of:
- A single search engine node?
- A single Key/Value store node?
- A single Database node?
- A single CDN request?
©2015 Azul Systems, Inc.
©2015 Azul Systems, Inc.
©2015 Azul Systems, Inc.	 	 	 	 	 	
#LatencyTipOfTheDay:
MOST page loads will experience
the 99%'lie server response
©2015 Azul Systems, Inc.	 	 	 	 	 	
Which HTTP response time metric is more
“representative” of user experience?
The 95%’lie or the 99.9%’lie
©2015 Azul Systems, Inc.	 	 	 	 	 	
Gauging user experience
Example: If a typical user session involves 5 page
loads, averaging 40 resources per page.
- How many of our users will NOT experience
something worse than the 95%’lie of http requests?
Answer: ~0.003%
- How may of our users will experience at least one
response that is longer than the 99.9%’lie?
Answer: ~18%
©2015 Azul Systems, Inc.	 	 	 	 	 	
Gauging user experience
Example: If a typical user session involves 5 page
loads, averaging 40 resources per page.
- What http response percentile will be experienced
by the 95%’ile of users?
Answer: ~99.97%
- What http response percentile will be experienced
by the 99%’ile of users
Answer: ~99.995%
©2015 Azul Systems, Inc.	 	 	 	 	 	
#LatencyTipOfTheDay:
Median Server Response Time:
The number that 99.9999999999%
of page views can be worse than
©2015 Azul Systems, Inc.	 	 	 	 	 	
Why don’t we have response
time or latency stats with
multiple 9s in them???
©2015 Azul Systems, Inc.	 	 	 	 	 	
You can’t average
percentiles…
And you also can’t get an
hour’s 99.999%’lie out of lots
of 10 second interval 99%’lie
reports…
Why don’t we have response
time or latency stats with
multiple 9s in them???
©2015 Azul Systems, Inc.	 	 	 	 	 	
You can’t average percentiles…
And you also can’t get an hour’s
99.999%’lie out of lots
of 10 second interval 99%’lie reports…
Check out HdrHistogram
It lets you have nice things….
Why don’t we have response
time or latency stats with
multiple 9s in them???
0"
20"
40"
60"
80"
100"
120"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(mse
&Elapsed&Time&(sec)&
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
©2015 Azul Systems, Inc.	 	 	 	 	 	
Latency “wishful thinking”
We know how to compute
averages & std. deviation, etc.

Wouldn’t it be nice if latency
had a normal distribution?

The average, 90%’lie, 99%’lie,
std. deviation, etc. can give us
a “feel” for the rest of the
distribution, right?

If 99% of the stuff behaves
well, how bad can the rest be,
really?
©2015 Azul Systems, Inc.	 	 	 	 	 	
The real world: latency distribution
0%! 90%! 99%!
0!
0.05!
0.1!
0.15!
0.2!
0.25!
0.3!
0.35!
0.4!
0.45!
0.5!
Latency(msec)!
!
!
Percentile!
Latency by Percentile Distribution!
©2015 Azul Systems, Inc.	 	 	 	 	 	
The real world: latency distribution
0%! 90%! 99%! 99.9%!
0!
0.5!
1!
1.5!
2!
2.5!
3!
3.5!
4!
4.5!
5!
Latency(msec)!
!
!
Percentile!
Latency by Percentile Distribution!
©2015 Azul Systems, Inc.	 	 	 	 	 	
The real world: latency distribution
0%! 90%! 99%! 99.9%! 99.99%! 99.999%! 99.9999%!
0!
10!
20!
30!
40!
50!
60!
Latency(msec)!
!
!
Percentile!
Latency by Percentile Distribution!
©2015 Azul Systems, Inc.	 	 	 	 	 	
Dispelling standard deviation
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
10#
20#
30#
40#
50#
Latency(((msec)(
(
(
Percen/le(
Latency(by(Percen/le(Distribu/on(
A#
B#
C#
D#
E#
F#
©2015 Azul Systems, Inc.	 	 	 	 	 	
Dispelling standard deviation
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
10#
20#
30#
40#
50#
Latency(((msec)(
(
(
Percen/le(
Latency(by(Percen/le(Distribu/on(
A#
B#
C#
D#
E#
F#
Mean = 0.06 msec

Std. Deviation (𝞂) = 0.21msec
99.999% = 38.66msec
In a normal distribution,
These are NOT normal distributions
~184 σ (!!!) away from the mean
the 99.999%’ile falls within 4.5 σ
©2015 Azul Systems, Inc.	 	 	 	 	 	
The coordinated omission problem
An accidental conspiracy...
The lie in the 99%’lies
©2015 Azul Systems, Inc.	 	 	 	 	 	
The coordinated omission
problem
Common Example A (load testing):

each “client” issues requests at a certain rate

measure/log response time for each request

So what’s wrong with that?

works only if ALL responses fit within interval

implicit “automatic back off” coordination
©2015 Azul Systems, Inc.	 	 	 	 	 	
Common Example B: 

Coordinated Omission in Monitoring Code
Long operations only get measured once
delays outside of timing window do not get measured at all
How	bad	can	this	get?
40
Avg. is 1 msec
over 1st 100 sec
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Avg. is 50 sec.
over next 100 sec
Overall Average response time is ~25 sec.
Measurement	in	prac8ce
41
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
What actually gets measured?
(should be ~50sec) (should be ~100 sec)
50%‘ile is 1 msec 75%‘lie is 1 msec 99.99%‘lie is 1 msec
10,000 measurements
@ 1 msec each
1 measurement
@ 100 sec
Overall Average is 10.9 msec (!!!)
Proper	measurement
42
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Proper	measurement
43
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Coordinated
Omission
1 msec 1 msec
1 result
@ 100 sec
“Be?er”	can	look	“Worse”
44
System Slowed
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 @ 1msec 10,000 @ 5 msec
50%‘ile is 1 msec 75%‘lie is 2.5msec 99.99%‘lie is ~5msec
Still easily handles
100 requests/sec
Responds to each
in 5 msec
(stalled shows 1 msec) (stalled shows 1 msec)
“Correc8on”:	“Chea8ng	Twice”
45
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.994%‘ile is ~100sec
Coordinated
Omission
1 msec 1 msec
9,999 additional results
@ 1 msec each
1 result
@ 100 sec
“correction”
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time vs. Service Time
©2015 Azul Systems, Inc.	 	 	 	 	 	
Service Time vs. Response Time
©2015 Azul Systems, Inc.	 	 	 	 	 	
Coordinated Omission
Usually
makes something that you think is a
Response Time metric
only represent
the Service Time component
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time vs. Service Time @2K/sec
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time vs. Service Time @20K/sec
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time vs. Service Time @60K/sec
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time vs. Service Time @80K/sec
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time vs. Service Time @90K/sec
©2015 Azul Systems, Inc.	 	 	 	 	 	
How “real” people react
©2015 Azul Systems, Inc.	 	 	 	 	 	
Service Time, 90K/s vs 80K/s
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time, 90K/s vs 80K/s
©2015 Azul Systems, Inc.	 	 	 	 	 	
Response Time, 90K/s vs 80K/s (to scale)
©2015 Azul Systems, Inc.	 	 	 	 	 	
Latency doesn’t live in a vacuum
©2015 Azul Systems, Inc.	 	 	 	 	 	
Sustainable Throughput:
The throughput achieved while
safely maintaining service levels
©2015 Azul Systems, Inc.	 	 	 	 	 	
Comparing behavior under different throughputs
and/or configurations
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
0#
20#
40#
60#
80#
100#
120#
!Dura&on!(msec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
Setup#D#
Setup#E#
Setup#F#
Setup#A#
Setup#B#
Setup#C#
©2015 Azul Systems, Inc.	 	 	 	 	 	
Comparing response time or
latency behaviors
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A @90K/s & 85K/s vs.
System B @90K/s & 85K/s
Wrong Place to Look:
They both “suck” at >85K/sec
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A 85K/s vs. System B 85K/s
Looks good, but still
the wrong place to look
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A @40K/s vs. System B @40K/s
More interesting…
What can we do with this?
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A @10K/s vs. System B @40K/s
E.g. if “99%’ile < 5msec” was a goal:
System B delivers similar 99%’ile and superior
99.9%’ile+ while carrying 4x the throughput
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A @2K/s vs. System B @20K/s
E.g. if “99.9%’ile < 10msec” was a goal:
System B delivers similar 99%’ile and 99.9%’ile
while carrying 10x the throughput
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A @2k thru 80k
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A @2k thru 70k
©2015 Azul Systems, Inc.	 	 	 	 	 	
System B @20k thru 70k
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A & System B @2k thru 70k
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A & System B
10K/s thru 60K/s
System A @ 10K, 20K, 40K, 60K
System B @20K, 40K, 60K
Lots of conclusions can be drawn from the above…
E.g. System B delivers a consistent 100x reduction in the
rate of occurrence of >20msec response times
©2015 Azul Systems, Inc.	 	 	 	 	 	
System A: 200-1400 msec stalls
System B drawn to scale
Service timeResponse Time Response TimeService time
©2015 Azul Systems, Inc.	 	 	 	 	 	
This is Your Load on System A
This is Your Load on System B
Any Questions?
A simple visual summary
©2015 Azul Systems, Inc.	 	 	 	 	 	
http://www.azulsystems.com
Any Questions?
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/latency-
response-time

More Related Content

PDF
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
PDF
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
PDF
Log design
PDF
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
PDF
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
PDF
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
PDF
PySpark 배우기 Ch 06. ML 패키지 소개하기
PDF
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
Log design
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
PySpark 배우기 Ch 06. ML 패키지 소개하기
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축

What's hot (20)

PDF
스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...
PDF
[Main Session] 카프카, 데이터 플랫폼의 최강자
PDF
How to build massive service for advance
PDF
Scalable webservice
PDF
Data Engineering 101
PDF
카프카, 산전수전 노하우
PPTX
Data pipeline and data lake
PDF
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
PPTX
로그 기깔나게 잘 디자인하는 법
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
PDF
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
PDF
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
PDF
Deep Learning for Personalized Search and Recommender Systems
PDF
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
PDF
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
PDF
Shallow and Deep Latent Models for Recommender System
PDF
20190619 AWS Black Belt Online Seminar Dive Deep into AWS Chalice
PDF
AWS 기반의 마이크로 서비스 아키텍쳐 구현 방안 :: 김필중 :: AWS Summit Seoul 20
PDF
アドテク×Scala×パフォーマンスチューニング
PPTX
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...
[Main Session] 카프카, 데이터 플랫폼의 최강자
How to build massive service for advance
Scalable webservice
Data Engineering 101
카프카, 산전수전 노하우
Data pipeline and data lake
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
로그 기깔나게 잘 디자인하는 법
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
Deep Learning for Personalized Search and Recommender Systems
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
Shallow and Deep Latent Models for Recommender System
20190619 AWS Black Belt Online Seminar Dive Deep into AWS Chalice
AWS 기반의 마이크로 서비스 아키텍쳐 구현 방안 :: 김필중 :: AWS Summit Seoul 20
アドテク×Scala×パフォーマンスチューニング
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
Ad

Similar to How NOT to Measure Latency (20)

PDF
Misery Metrics & Consequences
PDF
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
PDF
Understanding Latency Response Time and Behavior
PDF
How NOT to Measure Latency
PDF
Demystifying Sample Size - How Many Participants Do You Really Need for UX Re...
PPTX
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
PPTX
Provisioning and Capacity Planning (Travel Meets Big Data)
PDF
How Machines Help Humans Root Case Issues @ Netflix
PDF
Hidden Costs of Chasing the Mythical 'Five Nines'
PDF
Reliability?
PDF
Forecasting using monte carlo simulations
PDF
Enabling Java in Latency Sensitive Environments
PDF
GCP-pdevops devops engineer exam prepearitaon guide
PPTX
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
PPTX
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
PPT
Toc Education
PDF
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
PPTX
When down is not good enough. SRE On Azure
PPTX
Prometheus (Prometheus London, 2016)
Misery Metrics & Consequences
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Understanding Latency Response Time and Behavior
How NOT to Measure Latency
Demystifying Sample Size - How Many Participants Do You Really Need for UX Re...
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Provisioning and Capacity Planning (Travel Meets Big Data)
How Machines Help Humans Root Case Issues @ Netflix
Hidden Costs of Chasing the Mythical 'Five Nines'
Reliability?
Forecasting using monte carlo simulations
Enabling Java in Latency Sensitive Environments
GCP-pdevops devops engineer exam prepearitaon guide
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
Toc Education
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
When down is not good enough. SRE On Azure
Prometheus (Prometheus London, 2016)
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Tartificialntelligence_presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Tartificialntelligence_presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25-Week II
Assigned Numbers - 2025 - Bluetooth® Document
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology

How NOT to Measure Latency

  • 1. ©2015 Azul Systems, Inc. How NOT to Measure Latency Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/ latency-response-time
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. ©2015 Azul Systems, Inc. The “Oh S@%#!” talk Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 5. ©2015 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... I also depress people by pulling the wool up from over their eyes… * working on real-world trash compaction issues, circa 2004
  • 7. ©2015 Azul Systems, Inc. Latency Behavior Latency: The time it took one operation to happen Each operation occurrence has its own latency What we care about is how latency behaves Behavior is a lot more than “the common case was X”
  • 8. ©2015 Azul Systems, Inc. We like to look at pretty charts… 95%’lie The “We only want to show good things” chart
  • 10. ©2015 Azul Systems, Inc. A real world, real time example
  • 11. ©2015 Azul Systems, Inc. A real world, real time example
  • 12. ©2015 Azul Systems, Inc. A real world, real time example So this is a better picture. Right?
  • 13. ©2015 Azul Systems, Inc. Why do we tend to avoid plotting Max latency? Because no other %’ile will be visible on the same chart…
  • 14. ©2015 Azul Systems, Inc. I like to rant about latency…
  • 15. ©2015 Azul Systems, Inc. #LatencyTipOfTheDay: If you are not measuring and/or plotting Max, what are you hiding (from)?
  • 17. ©2015 Azul Systems, Inc. What (TF) does the Average of the 95%’lie mean?
  • 18. ©2015 Azul Systems, Inc. What (TF) does the Average of the 95%’lie mean? Lets do the same with 100%’ile; Suppose we a set of 100%’ile values for each minute: [1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2] “The average 100%’ile over the past 15 minutes was 42” Same nonsense applies to any other %’lie
  • 19. ©2015 Azul Systems, Inc. #LatencyTipOfTheDay: You can't average percentiles. Period.
  • 20. ©2015 Azul Systems, Inc. Percentiles Matter
  • 21. ©2015 Azul Systems, Inc. Is the 99%’lie “rare”?
  • 22. ©2015 Azul Systems, Inc. 99%’lie: a good indicator, right? What are the chances of a single web page view experiencing >99%’lie latency of: - A single search engine node? - A single Key/Value store node? - A single Database node? - A single CDN request?
  • 25. ©2015 Azul Systems, Inc. #LatencyTipOfTheDay: MOST page loads will experience the 99%'lie server response
  • 26. ©2015 Azul Systems, Inc. Which HTTP response time metric is more “representative” of user experience? The 95%’lie or the 99.9%’lie
  • 27. ©2015 Azul Systems, Inc. Gauging user experience Example: If a typical user session involves 5 page loads, averaging 40 resources per page. - How many of our users will NOT experience something worse than the 95%’lie of http requests? Answer: ~0.003% - How may of our users will experience at least one response that is longer than the 99.9%’lie? Answer: ~18%
  • 28. ©2015 Azul Systems, Inc. Gauging user experience Example: If a typical user session involves 5 page loads, averaging 40 resources per page. - What http response percentile will be experienced by the 95%’ile of users? Answer: ~99.97% - What http response percentile will be experienced by the 99%’ile of users Answer: ~99.995%
  • 29. ©2015 Azul Systems, Inc. #LatencyTipOfTheDay: Median Server Response Time: The number that 99.9999999999% of page views can be worse than
  • 30. ©2015 Azul Systems, Inc. Why don’t we have response time or latency stats with multiple 9s in them???
  • 31. ©2015 Azul Systems, Inc. You can’t average percentiles… And you also can’t get an hour’s 99.999%’lie out of lots of 10 second interval 99%’lie reports… Why don’t we have response time or latency stats with multiple 9s in them???
  • 32. ©2015 Azul Systems, Inc. You can’t average percentiles… And you also can’t get an hour’s 99.999%’lie out of lots of 10 second interval 99%’lie reports… Check out HdrHistogram It lets you have nice things…. Why don’t we have response time or latency stats with multiple 9s in them??? 0" 20" 40" 60" 80" 100" 120" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(mse &Elapsed&Time&(sec)& 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140"Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA"
  • 33. ©2015 Azul Systems, Inc. Latency “wishful thinking” We know how to compute averages & std. deviation, etc. Wouldn’t it be nice if latency had a normal distribution? The average, 90%’lie, 99%’lie, std. deviation, etc. can give us a “feel” for the rest of the distribution, right? If 99% of the stuff behaves well, how bad can the rest be, really?
  • 34. ©2015 Azul Systems, Inc. The real world: latency distribution 0%! 90%! 99%! 0! 0.05! 0.1! 0.15! 0.2! 0.25! 0.3! 0.35! 0.4! 0.45! 0.5! Latency(msec)! ! ! Percentile! Latency by Percentile Distribution!
  • 35. ©2015 Azul Systems, Inc. The real world: latency distribution 0%! 90%! 99%! 99.9%! 0! 0.5! 1! 1.5! 2! 2.5! 3! 3.5! 4! 4.5! 5! Latency(msec)! ! ! Percentile! Latency by Percentile Distribution!
  • 36. ©2015 Azul Systems, Inc. The real world: latency distribution 0%! 90%! 99%! 99.9%! 99.99%! 99.999%! 99.9999%! 0! 10! 20! 30! 40! 50! 60! Latency(msec)! ! ! Percentile! Latency by Percentile Distribution!
  • 37. ©2015 Azul Systems, Inc. Dispelling standard deviation 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 10# 20# 30# 40# 50# Latency(((msec)( ( ( Percen/le( Latency(by(Percen/le(Distribu/on( A# B# C# D# E# F#
  • 38. ©2015 Azul Systems, Inc. Dispelling standard deviation 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 10# 20# 30# 40# 50# Latency(((msec)( ( ( Percen/le( Latency(by(Percen/le(Distribu/on( A# B# C# D# E# F# Mean = 0.06 msec Std. Deviation (𝞂) = 0.21msec 99.999% = 38.66msec In a normal distribution, These are NOT normal distributions ~184 σ (!!!) away from the mean the 99.999%’ile falls within 4.5 σ
  • 39. ©2015 Azul Systems, Inc. The coordinated omission problem An accidental conspiracy... The lie in the 99%’lies
  • 40. ©2015 Azul Systems, Inc. The coordinated omission problem Common Example A (load testing): each “client” issues requests at a certain rate measure/log response time for each request So what’s wrong with that? works only if ALL responses fit within interval implicit “automatic back off” coordination
  • 41. ©2015 Azul Systems, Inc. Common Example B: Coordinated Omission in Monitoring Code Long operations only get measured once delays outside of timing window do not get measured at all
  • 42. How bad can this get? 40 Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Avg. is 50 sec. over next 100 sec Overall Average response time is ~25 sec.
  • 43. Measurement in prac8ce 41 System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec What actually gets measured? (should be ~50sec) (should be ~100 sec) 50%‘ile is 1 msec 75%‘lie is 1 msec 99.99%‘lie is 1 msec 10,000 measurements @ 1 msec each 1 measurement @ 100 sec Overall Average is 10.9 msec (!!!)
  • 44. Proper measurement 42 System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
  • 45. Proper measurement 43 System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Coordinated Omission 1 msec 1 msec 1 result @ 100 sec
  • 46. “Be?er” can look “Worse” 44 System Slowed for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 @ 1msec 10,000 @ 5 msec 50%‘ile is 1 msec 75%‘lie is 2.5msec 99.99%‘lie is ~5msec Still easily handles 100 requests/sec Responds to each in 5 msec (stalled shows 1 msec) (stalled shows 1 msec)
  • 47. “Correc8on”: “Chea8ng Twice” 45 System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.994%‘ile is ~100sec Coordinated Omission 1 msec 1 msec 9,999 additional results @ 1 msec each 1 result @ 100 sec “correction”
  • 48. ©2015 Azul Systems, Inc. Response Time vs. Service Time
  • 49. ©2015 Azul Systems, Inc. Service Time vs. Response Time
  • 50. ©2015 Azul Systems, Inc. Coordinated Omission Usually makes something that you think is a Response Time metric only represent the Service Time component
  • 51. ©2015 Azul Systems, Inc. Response Time vs. Service Time @2K/sec
  • 52. ©2015 Azul Systems, Inc. Response Time vs. Service Time @20K/sec
  • 53. ©2015 Azul Systems, Inc. Response Time vs. Service Time @60K/sec
  • 54. ©2015 Azul Systems, Inc. Response Time vs. Service Time @80K/sec
  • 55. ©2015 Azul Systems, Inc. Response Time vs. Service Time @90K/sec
  • 56. ©2015 Azul Systems, Inc. How “real” people react
  • 57. ©2015 Azul Systems, Inc. Service Time, 90K/s vs 80K/s
  • 58. ©2015 Azul Systems, Inc. Response Time, 90K/s vs 80K/s
  • 59. ©2015 Azul Systems, Inc. Response Time, 90K/s vs 80K/s (to scale)
  • 60. ©2015 Azul Systems, Inc. Latency doesn’t live in a vacuum
  • 61. ©2015 Azul Systems, Inc. Sustainable Throughput: The throughput achieved while safely maintaining service levels
  • 62. ©2015 Azul Systems, Inc. Comparing behavior under different throughputs and/or configurations 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 0# 20# 40# 60# 80# 100# 120# !Dura&on!(msec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! Setup#D# Setup#E# Setup#F# Setup#A# Setup#B# Setup#C#
  • 63. ©2015 Azul Systems, Inc. Comparing response time or latency behaviors
  • 64. ©2015 Azul Systems, Inc. System A @90K/s & 85K/s vs. System B @90K/s & 85K/s Wrong Place to Look: They both “suck” at >85K/sec
  • 65. ©2015 Azul Systems, Inc. System A 85K/s vs. System B 85K/s Looks good, but still the wrong place to look
  • 66. ©2015 Azul Systems, Inc. System A @40K/s vs. System B @40K/s More interesting… What can we do with this?
  • 67. ©2015 Azul Systems, Inc. System A @10K/s vs. System B @40K/s E.g. if “99%’ile < 5msec” was a goal: System B delivers similar 99%’ile and superior 99.9%’ile+ while carrying 4x the throughput
  • 68. ©2015 Azul Systems, Inc. System A @2K/s vs. System B @20K/s E.g. if “99.9%’ile < 10msec” was a goal: System B delivers similar 99%’ile and 99.9%’ile while carrying 10x the throughput
  • 69. ©2015 Azul Systems, Inc. System A @2k thru 80k
  • 70. ©2015 Azul Systems, Inc. System A @2k thru 70k
  • 71. ©2015 Azul Systems, Inc. System B @20k thru 70k
  • 72. ©2015 Azul Systems, Inc. System A & System B @2k thru 70k
  • 73. ©2015 Azul Systems, Inc. System A & System B 10K/s thru 60K/s System A @ 10K, 20K, 40K, 60K System B @20K, 40K, 60K Lots of conclusions can be drawn from the above… E.g. System B delivers a consistent 100x reduction in the rate of occurrence of >20msec response times
  • 74. ©2015 Azul Systems, Inc. System A: 200-1400 msec stalls System B drawn to scale Service timeResponse Time Response TimeService time
  • 75. ©2015 Azul Systems, Inc. This is Your Load on System A This is Your Load on System B Any Questions? A simple visual summary
  • 76. ©2015 Azul Systems, Inc. http://www.azulsystems.com Any Questions?
  • 77. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/latency- response-time