SlideShare a Scribd company logo
Exposing the Cost of Performance
Hidden in the Cloud
Neil Gunther @DrQz
Performance Dynamics Consulting, Castro Valley, California
Mohit Chawla @a1cy
Independent Systems Engineer, Hamburg, Germany
Performance Dynamics Co.
CMG CLOUDXCHANGE Event
10am Pacific (5pm UTC), June 19, 2018
Video on CMG’s YouTube channel
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 1 / 35
Exposing the Cost of Performance
Hidden in the Cloud
Neil Gunther and Mohit Chawla
Abstract
Whilst offering lift-and-shift migration and versatile elastic capacity, the cloud also
reintroduces an old mainframe concept — chargeback1
— which thereby
rejuvenates the need for traditional performance and capacity management in the
new cloud context. Combining production JMX data with an appropriate
performance model, we show how to assess fee-based Amazon AWS configurations
for a mobile-user application running on a Linux-hosted Tomcat cluster. The
performance model also facilitates ongoing cost-benefit analysis of various EC2
Auto Scaling policies.
1
Chargeback underpins the cloud business model, especially for hot application development, e.g., “Microsoft wants every
developer to be an AI developer, which would help its already booming Azure Cloud business do better still: AI demands data,
which requires cloud processing power and generates bills.” —The Register, May 2018
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 2 / 35
AWS cloud environment
Outline
1 AWS cloud environment
2 Performance data validation
3 Initial capacity model
4 Improved capacity model
5 Cost of Auto Scaling variants
6 Cloudy economics
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 3 / 35
AWS cloud environment
Application Cloud Platform
Entire application runs in the Amazon cloud
Mobile Internet users
ELB load balancer
Auto Scaling (A/S) group
AWS EC2 cluster
Mobile users make requests to Apache
HTTP-server2
via ELB on EC2
Tomcat thread-server3
on EC2 calls external
services (belonging to 3rd parties)
Auto Scaling controls number of EC2 instances
based on incoming traffic and configured A/S
policies
ELB balances incoming traffic across all EC2
nodes in AWS cluster
2
Versions 2.2 and 2.4
3
Versions 7 and 8
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 4 / 35
AWS cloud environment
Request Processing
On a single EC2 instance:
1 Incoming HTTP Request from mobile user processed by Apache + Tomcat
2 Tomcat then sends multiple requests to External Services based on original request
3 External services respond and Tomcat computes business logic based on all those
Responses
4 Tomcat sends the final Response back to originating mobile user
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 5 / 35
AWS cloud environment
Performance Tools and Scripts
JMX (Java Management Extensions) data from JVM
jmxterm
VisualVM
Java Mission Control
Datadog dd-agent
Datadog — also integrates with AWS CloudWatch metrics
Collectd — Linux performance statistics collection
Graphite and statsd — application metrics collection & storage
Grafana — time-series data plotting
Custom data collection scripts
R statistical libs and RStudio IDE
PDQ performance modeling lib
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 6 / 35
Performance data validation
Outline
1 AWS cloud environment
2 Performance data validation
3 Initial capacity model
4 Improved capacity model
5 Cost of Auto Scaling variants
6 Cloudy economics
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 7 / 35
Performance data validation
Production Data Collection
1 Raw performance metrics:
Performance data primarily collected by datadog (dd-agent)
Mobile-user requests are analyzed as a homogeneous workload
JMX provides a GlobalRequestProcessor Mbean:
requestCount: total number of requests
processingTime: total processing time for all requests
2 Derived performance metrics:
Convert requestCount to a rate in datadog config to get average
throughput Xdat as requests/second
Average request processing time (seconds) is then derived as
Rdat =
processingTime
T
T
requestCount
during the same measurement interval, e.g., T = 300 seconds
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 8 / 35
Performance data validation
Concurrency and Service Times
Apply Little’s law to derive additional performance metrics: concurrency (N)
and service time (S) from data
1 Little’s Law — macroscopic version
N = X ∗ R (gives concurrency)
Nest is the calculated or estimated number of concurrent requests in
Tomcat during each measurement interval
Verify correctness by comparing Nest with measured number of
threads Ndat in the service stage of Tomcat
We find Nest ≡ Ndat
2 Little’s Law — microscopic version
U = X ∗ S (gives service time)
Udat is the measured processor utilization reported by dd-agent
(as a decimal fraction, not %)
Already have throughput X reqs/sec from collected JMX data
Estimated service time metric is S = U/X
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 9 / 35
Performance data validation
Reduced EC2 Instance Data
These few metrics will be used to parameterize our capacity model
Timestamp, Xdat, Nest, Sest, Rdat, Udat
1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120
1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420
1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980
1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700
1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860
1486772700000, 528.587722, 201.187500, 0.000914, 0.366283, 0.483160
1486773000000, 533.439054, 202.600006, 0.000892, 0.378207, 0.476080
1486773300000, 531.708059, 208.187500, 0.000909, 0.392556, 0.483160
1486773600000, 532.693783, 203.266663, 0.000894, 0.379749, 0.476020
1486773900000, 519.748550, 200.937500, 0.000895, 0.381078, 0.465260
...
Unix Timestamp interval between rows is 300 seconds
Little’s law gives relationships between above metrics:
1 Nest = Xdat ∗ Rdat is macroscopic LL
2 Udat = Xdat ∗ Sest is microscopic LL
3 Time-averaged over T = 300 sec sampling intervals
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 10 / 35
Initial capacity model
Outline
1 AWS cloud environment
2 Performance data validation
3 Initial capacity model
4 Improved capacity model
5 Cost of Auto Scaling variants
6 Cloudy economics
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 11 / 35
Initial capacity model
Time Series View
18:00 23:00 04:00 09:00 14:00
0200400600800
UTC time (hours)
RequestrateX(t)
July 2016 Throughput X(t)
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 12 / 35
Initial capacity model
Time-Independent View
N
X
Thread-limited Throughput
N
R
Thread-limited Latency
Queueing theory tells us what to expect:
Relationship between metrics, e.g., X and N
Number of requests is thread-limited to N ≤ 500 typically
Throughput X approaches a saturation ceiling as N → 500 (concave)
Response time R grows linearly, aka “hockey stick handle” (convex)
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 13 / 35
Initial capacity model
Production X vs. N Data – July 2016
0 100 200 300 400 500
02004006008001000
Production Data July 2016
Concurrent users
Throughput(req/s)
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 14 / 35
Initial capacity model
Interpreting X vs. N Data
0 100 200 300 400 500
02004006008001000
Concurrent users
Throughput(req/s)
PDQ Model of Production Data July 2016
Nopt = 174.5367
thrds = 250.00
Data
PDQ
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 15 / 35
Initial capacity model
Interpreting R vs. N Data
0 100 200 300 400
0.00.20.40.60.8
Concurrent users
Responsetime(s)
PDQ Model of Production Data July 2016
Nopt = 174.5367
thrds = 250.00
Data
PDQ
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 16 / 35
Initial capacity model
Outstanding Questions
PDQ July model looks good visually but ...
Requires ∼ 350 “dummy” queues internally to get correct Rmin
Service time assumed to be CPU time ∼ 1 ms (see later)
What do dummy queues represent in Tomcat server?
Successive polling to external services?
Some kind of hidden parallelism?
October 2016 data breaks July PDQ model Why?
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 17 / 35
Improved capacity model
Outline
1 AWS cloud environment
2 Performance data validation
3 Initial capacity model
4 Improved capacity model
5 Cost of Auto Scaling variants
6 Cloudy economics
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 18 / 35
Improved capacity model
Production X vs. N Data – October 2016
0 100 200 300 400 500
02004006008001000
Production data Oct 2016
Concurrent users
Throughput(req/s)
Too much data “clouded” the July 2016 analysis
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 19 / 35
Improved capacity model
Interpreting X vs. N Data
0 100 200 300 400 500
02004006008001000
Concurrent users
Throughput(req/s)
PDQ Model of Oct 2016 Data
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 20 / 35
Improved capacity model
Interpreting R vs. N Data
0 100 200 300 400
0.00.20.40.60.8
Concurrent users
Responsetime(s)
Data
PDQ
PDQ Model of Oct 2016 Data
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 21 / 35
Improved capacity model
Adjusted PDQ Model
library(pdq)
usrmax <- 500
nknee <- 350
smean <- 0.4444 # Rmin seconds
srate <- 1 / smean
arate <- 2.1 # per user
users <- seq(100, usrmax, 50)
tp <- NULL
rt <- NULL
pdqr <- TRUE # PDQ Report
for (i in 1:length(users)) {
if (users[i] <= nknee) {
Arate <- users[i] * arate # total arrivals
pdq::Init("Tomcat Submodel")
pdq::CreateOpen("requests", Arate)
pdq::CreateMultiNode(users[i], "TCthreads")
pdq::SetDemand("TCthreads", "requests", smean)
pdq::SetWUnit("Reqs")
pdq::Solve(CANON)
tp[i] <- pdq::GetThruput(TRANS, "requests")
rt[i] <- pdq::GetResponse(TRANS, "requests")
....
Key differences:
Old service time based on
%CPU busy: S = 0.8 ms
Rmin dominated by time
inside external services
New service time based
on Rmin: S = 444.4 ms
Tomcat threads are now
parallel service centers in
PDQ model
Analogous to every
supermarket customer
getting their own
checkout lane
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 22 / 35
Improved capacity model
Adjusted 2016 PDQ Outputs
0 100 200 300 400 500
02004006008001000
Concurrent users
Throughput(req/s)
PDQ Model of Oct 2016 Data
0 100 200 300 400
0.00.20.40.60.8
Concurrent users
Responsetime(s)
Data
PDQ
PDQ Model of Oct 2016 Data
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 23 / 35
Improved capacity model
Auto Scaling knee and pseudo-saturation
0 100 200 300 400 500
02004006008001000
Concurrent users
Throughput(req/s)
PDQ Model of Oct 2016 Data
A/S policy triggered when instance CPU busy > 75%
Induces pseudo-saturation at Nknee = 300 threads (vertical line)
No additional Tomcat threads invoked above Nknee in this instance
A/S spins up additional new EC2 instances (elastic capacity)
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 24 / 35
Cost of Auto Scaling variants
Outline
1 AWS cloud environment
2 Performance data validation
3 Initial capacity model
4 Improved capacity model
5 Cost of Auto Scaling variants
6 Cloudy economics
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 25 / 35
Cost of Auto Scaling variants
AWS Scheduled Scaling
A/S policy threshold CPU > 75%
Additional EC2 instances require up to
10 minutes to spin up
Based on PDQ model, considered
pre-emptive scheduling of EC2s (clock)
Cheaper than A/S but only 10% savings
Use N service threads to size the
number of EC2 instances required for
incoming traffic
Removes expected spikes in latency
and traffic (seen in time series analysis)
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 26 / 35
Cost of Auto Scaling variants
AWS Spot Pricing
Spot instances available at 90%
discount over On-demand pricing
Challenging to diversify instance types
and sizes across the same group, e.g.,
Default instance type
m4.10xlarge
Spot market only has smaller
m4.2xlarge type
Forces manual reconfiguration of
application
Thus, CPU%, latency, traffic, no longer
useful metrics for A/S policy
Instead, use concurrency N as primary
metric in A/S policy
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 27 / 35
Cloudy economics
Outline
1 AWS cloud environment
2 Performance data validation
3 Initial capacity model
4 Improved capacity model
5 Cost of Auto Scaling variants
6 Cloudy economics
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 28 / 35
Cloudy economics
EC2 Instance Pricing
Missed revenue?
Max capacity line
Spot instances
On-demand instances
Reserved instances
Higher
risk
capex
Lower
risk
capex
Time
Instances
Instance capacity lines4
This is how AWS sees their own infrastructure capacity
4
J.D. Mills, “Amazon Lambda and the Transition to Cloud 2.0”, SF Bay ACM meetup, May 16, 2018
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 29 / 35
Cloudy economics
Updated 2018 PDQ Outputs
0 100 200 300 400 500 600
050010001500
PDQ Model of Prod Data Mar 2018
Concurrent users
Throughput(req/sec)
Rmin = 0.2236
Xknee = 1137.65
Nknee = 254.35
0 100 200 300 400 500 600
0.00.10.20.30.40.5
PDQ Model of Prod Data Mar 2018
Concurrent users
Responsetime(s)
Rmin = 0.2236
Xknee = 1137.65
Nknee = 254.35
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 30 / 35
Cloudy economics
Performance Evolution 2016 – 2018
2016 daily users
20:00 01:00 06:00 11:00 16:00
150200250300350400450
UTC time (hours)
Userrequests(N)
2018 daily users
20:00 01:00 06:00 11:00 16:00
0100200300400500600
UTC time (hours)
Userrequests(N)
Typical numero uno traffic profile
Increasing cost-effective performance
Date Rmin (ms) Xmax (RPS) Nknee
Jul 2016 394.1 761.23 350
Oct 2016 444.4 675.07 300
Mar 2018 223.6 1135.96 254
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 31 / 35
Cloudy economics
Name of the Game is Chargeback
Google Compute Engine also offers reserved and spot pricing
Table 1: Google VM per-hour pricing5
Machine vCPUs RAM (GB) Price ($) Preempt ($)
n1-umem-40 40 938 6.3039 1.3311
n1-umem-80 80 1922 12.6078 2.6622
n1-umem-96 96 1433 10.6740 2.2600
n1-umem-160 160 3844 25.2156 5.3244
Similarly for Microsoft Azure
5
TechCrunch, May 2018
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 32 / 35
Cloudy economics
Microsoft Acquires GitHub (cloud) for $7.5 BB 6
GitHub Enterprise on-site or cloud instances on AWS, Azure, Google or IBM Cloud is
$21 per user per month
From Twitter:
“Supporting the open source ecosystem is way more important to MS than anything else—the revenue they make from
hosting OSS-based apps on Azure in the future will dwarf their current devtools revenue.”
“[MS] isn’t the same company that [previousy] hated on open source, mostly because it’s [now] symbiotic to their hosting
business. They didn’t start supporting open source from altruism!”
6
NOTE: That’s Bs, as in billions, not Ms
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 33 / 35
Cloudy economics
Summary
Cloud services are more about economic benefit for
the hosting company than they are about technological
innovation for the consumer 7
Old-fashioned mainframe chargeback is back! 8
It’s incumbent on paying customers to minimize their
own cloud services costs
Meaningful cost-benefit decisions require ongoing
performance analysis and capacity planning
PDQ model presented here is a simple yet insightful
example of cloud sizing and performance tools 9
Queueing model framework helps expose where
hidden performance costs actually reside
You only have the cloud capacity that you pay for
7
Not just plug-and-play. More like pay-and-pay!
8
Chargeback had disappeared with the advent of non-monolithic client-server architectures
9
PDQ Workshop is available at a discount to CMG members. Email classes@perfdynamics.com for details.
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 34 / 35
Cloudy economics
Questions?
www.perfdynamics.com
Castro Valley, California
Training — including the PDQ Workshop
Blog
Twitter
Facebook
info@perfdynamics.com — any outstanding questions
+1-510-537-5758
c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 35 / 35

More Related Content

PPT
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
PDF
Value-Based Manufacturing Optimisation in Serverless Clouds for Industry 4.0
PPTX
A Study of Virtual Machine Placement Optimization in Data Centers (CLOSER'2017)
PPTX
Analysis of data science software 2020
PPTX
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
PPTX
Point Clouds: What's New
PDF
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
PPTX
Exploring Raster with FME
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
Value-Based Manufacturing Optimisation in Serverless Clouds for Industry 4.0
A Study of Virtual Machine Placement Optimization in Data Centers (CLOSER'2017)
Analysis of data science software 2020
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
Point Clouds: What's New
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Exploring Raster with FME

What's hot (18)

PDF
A Virtual Machine Placement Algorithm for Energy Efficient Cloud Resource Res...
PPTX
Cost-Aware Virtual Machine Placement across Distributed Data Centers using Ba...
PDF
GraphX and Pregel - Apache Spark
ODP
Graphs are everywhere! Distributed graph computing with Spark GraphX
PDF
Modeling presentation
PDF
Exploring Neo4j Graph Database as a Fast Data Access Layer
PDF
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
PDF
Latency SLOs Done Right @ SREcon EMEA 2019
PDF
Graph x pregel
PDF
Petroleum seminar 28.05.2014
PDF
[db tech showcase Tookyo 2018] #dbts2018 #B24 『Speed Meets Scale: Analyzing &...
PPTX
Detecting Buildings in AHN2 LiDAR data with ArcGIS - Grontmij
PPT
Qtp testing23
PDF
Bicod2017
PDF
PDF
Quantum Computing: The next new technology in computing
PPTX
Problem Solving and Product Delivery with FME in a Survey / Engineering Company
PDF
MineDB Mineral Resource Evaluation White Paper
A Virtual Machine Placement Algorithm for Energy Efficient Cloud Resource Res...
Cost-Aware Virtual Machine Placement across Distributed Data Centers using Ba...
GraphX and Pregel - Apache Spark
Graphs are everywhere! Distributed graph computing with Spark GraphX
Modeling presentation
Exploring Neo4j Graph Database as a Fast Data Access Layer
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
Latency SLOs Done Right @ SREcon EMEA 2019
Graph x pregel
Petroleum seminar 28.05.2014
[db tech showcase Tookyo 2018] #dbts2018 #B24 『Speed Meets Scale: Analyzing &...
Detecting Buildings in AHN2 LiDAR data with ArcGIS - Grontmij
Qtp testing23
Bicod2017
Quantum Computing: The next new technology in computing
Problem Solving and Product Delivery with FME in a Survey / Engineering Company
MineDB Mineral Resource Evaluation White Paper
Ad

Similar to Exposing the Cost of Performance Hidden in the Cloud (20)

PDF
Netflix SRE perf meetup_slides
PPTX
With Cloud Computing, Who Needs Performance Testing?
PPTX
EuroSTAR 2013 Albert Witteveen Final
PPTX
Performance testing in scope of migration to cloud by Serghei Radov
PDF
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
PDF
PhD Thesis: Performance Modeling of Cloud Computing Centers
PDF
Linux capacity planning
PDF
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
PDF
rerngvit_phd_seminar
PDF
PAC 2019 virtual Alexander Podelko
PDF
Capacity Planning for fun & profit
PPTX
Performance Testing: Putting Cloud Customers Back in the Driver’s Seat
PDF
Service performance and analysis in cloud computing extened 2
PDF
Performance architecture for cloud connect
PDF
Automatic Performance Modelling from Application Performance Management (APM)...
PDF
Autonomic Resource Provisioning for Cloud-Based Software
PDF
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
PDF
Towards a Unified View of Cloud Elasticity
PPTX
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
PPTX
ADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
Netflix SRE perf meetup_slides
With Cloud Computing, Who Needs Performance Testing?
EuroSTAR 2013 Albert Witteveen Final
Performance testing in scope of migration to cloud by Serghei Radov
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
PhD Thesis: Performance Modeling of Cloud Computing Centers
Linux capacity planning
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
rerngvit_phd_seminar
PAC 2019 virtual Alexander Podelko
Capacity Planning for fun & profit
Performance Testing: Putting Cloud Customers Back in the Driver’s Seat
Service performance and analysis in cloud computing extened 2
Performance architecture for cloud connect
Automatic Performance Modelling from Application Performance Management (APM)...
Autonomic Resource Provisioning for Cloud-Based Software
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Towards a Unified View of Cloud Elasticity
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
ADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
IB Computer Science - Internal Assessment.pptx
[EN] Industrial Machine Downtime Prediction
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Fluorescence-microscope_Botany_detailed content
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
SAP 2 completion done . PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Clinical guidelines as a resource for EBP(1).pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Reliability_Chapter_ presentation 1221.5784
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Database Infoormation System (DBIS).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
1_Introduction to advance data techniques.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...

Exposing the Cost of Performance Hidden in the Cloud

  • 1. Exposing the Cost of Performance Hidden in the Cloud Neil Gunther @DrQz Performance Dynamics Consulting, Castro Valley, California Mohit Chawla @a1cy Independent Systems Engineer, Hamburg, Germany Performance Dynamics Co. CMG CLOUDXCHANGE Event 10am Pacific (5pm UTC), June 19, 2018 Video on CMG’s YouTube channel c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 1 / 35
  • 2. Exposing the Cost of Performance Hidden in the Cloud Neil Gunther and Mohit Chawla Abstract Whilst offering lift-and-shift migration and versatile elastic capacity, the cloud also reintroduces an old mainframe concept — chargeback1 — which thereby rejuvenates the need for traditional performance and capacity management in the new cloud context. Combining production JMX data with an appropriate performance model, we show how to assess fee-based Amazon AWS configurations for a mobile-user application running on a Linux-hosted Tomcat cluster. The performance model also facilitates ongoing cost-benefit analysis of various EC2 Auto Scaling policies. 1 Chargeback underpins the cloud business model, especially for hot application development, e.g., “Microsoft wants every developer to be an AI developer, which would help its already booming Azure Cloud business do better still: AI demands data, which requires cloud processing power and generates bills.” —The Register, May 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 2 / 35
  • 3. AWS cloud environment Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 3 / 35
  • 4. AWS cloud environment Application Cloud Platform Entire application runs in the Amazon cloud Mobile Internet users ELB load balancer Auto Scaling (A/S) group AWS EC2 cluster Mobile users make requests to Apache HTTP-server2 via ELB on EC2 Tomcat thread-server3 on EC2 calls external services (belonging to 3rd parties) Auto Scaling controls number of EC2 instances based on incoming traffic and configured A/S policies ELB balances incoming traffic across all EC2 nodes in AWS cluster 2 Versions 2.2 and 2.4 3 Versions 7 and 8 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 4 / 35
  • 5. AWS cloud environment Request Processing On a single EC2 instance: 1 Incoming HTTP Request from mobile user processed by Apache + Tomcat 2 Tomcat then sends multiple requests to External Services based on original request 3 External services respond and Tomcat computes business logic based on all those Responses 4 Tomcat sends the final Response back to originating mobile user c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 5 / 35
  • 6. AWS cloud environment Performance Tools and Scripts JMX (Java Management Extensions) data from JVM jmxterm VisualVM Java Mission Control Datadog dd-agent Datadog — also integrates with AWS CloudWatch metrics Collectd — Linux performance statistics collection Graphite and statsd — application metrics collection & storage Grafana — time-series data plotting Custom data collection scripts R statistical libs and RStudio IDE PDQ performance modeling lib c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 6 / 35
  • 7. Performance data validation Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 7 / 35
  • 8. Performance data validation Production Data Collection 1 Raw performance metrics: Performance data primarily collected by datadog (dd-agent) Mobile-user requests are analyzed as a homogeneous workload JMX provides a GlobalRequestProcessor Mbean: requestCount: total number of requests processingTime: total processing time for all requests 2 Derived performance metrics: Convert requestCount to a rate in datadog config to get average throughput Xdat as requests/second Average request processing time (seconds) is then derived as Rdat = processingTime T T requestCount during the same measurement interval, e.g., T = 300 seconds c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 8 / 35
  • 9. Performance data validation Concurrency and Service Times Apply Little’s law to derive additional performance metrics: concurrency (N) and service time (S) from data 1 Little’s Law — macroscopic version N = X ∗ R (gives concurrency) Nest is the calculated or estimated number of concurrent requests in Tomcat during each measurement interval Verify correctness by comparing Nest with measured number of threads Ndat in the service stage of Tomcat We find Nest ≡ Ndat 2 Little’s Law — microscopic version U = X ∗ S (gives service time) Udat is the measured processor utilization reported by dd-agent (as a decimal fraction, not %) Already have throughput X reqs/sec from collected JMX data Estimated service time metric is S = U/X c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 9 / 35
  • 10. Performance data validation Reduced EC2 Instance Data These few metrics will be used to parameterize our capacity model Timestamp, Xdat, Nest, Sest, Rdat, Udat 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 1486772700000, 528.587722, 201.187500, 0.000914, 0.366283, 0.483160 1486773000000, 533.439054, 202.600006, 0.000892, 0.378207, 0.476080 1486773300000, 531.708059, 208.187500, 0.000909, 0.392556, 0.483160 1486773600000, 532.693783, 203.266663, 0.000894, 0.379749, 0.476020 1486773900000, 519.748550, 200.937500, 0.000895, 0.381078, 0.465260 ... Unix Timestamp interval between rows is 300 seconds Little’s law gives relationships between above metrics: 1 Nest = Xdat ∗ Rdat is macroscopic LL 2 Udat = Xdat ∗ Sest is microscopic LL 3 Time-averaged over T = 300 sec sampling intervals c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 10 / 35
  • 11. Initial capacity model Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 11 / 35
  • 12. Initial capacity model Time Series View 18:00 23:00 04:00 09:00 14:00 0200400600800 UTC time (hours) RequestrateX(t) July 2016 Throughput X(t) c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 12 / 35
  • 13. Initial capacity model Time-Independent View N X Thread-limited Throughput N R Thread-limited Latency Queueing theory tells us what to expect: Relationship between metrics, e.g., X and N Number of requests is thread-limited to N ≤ 500 typically Throughput X approaches a saturation ceiling as N → 500 (concave) Response time R grows linearly, aka “hockey stick handle” (convex) c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 13 / 35
  • 14. Initial capacity model Production X vs. N Data – July 2016 0 100 200 300 400 500 02004006008001000 Production Data July 2016 Concurrent users Throughput(req/s) c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 14 / 35
  • 15. Initial capacity model Interpreting X vs. N Data 0 100 200 300 400 500 02004006008001000 Concurrent users Throughput(req/s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 15 / 35
  • 16. Initial capacity model Interpreting R vs. N Data 0 100 200 300 400 0.00.20.40.60.8 Concurrent users Responsetime(s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 16 / 35
  • 17. Initial capacity model Outstanding Questions PDQ July model looks good visually but ... Requires ∼ 350 “dummy” queues internally to get correct Rmin Service time assumed to be CPU time ∼ 1 ms (see later) What do dummy queues represent in Tomcat server? Successive polling to external services? Some kind of hidden parallelism? October 2016 data breaks July PDQ model Why? c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 17 / 35
  • 18. Improved capacity model Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 18 / 35
  • 19. Improved capacity model Production X vs. N Data – October 2016 0 100 200 300 400 500 02004006008001000 Production data Oct 2016 Concurrent users Throughput(req/s) Too much data “clouded” the July 2016 analysis c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 19 / 35
  • 20. Improved capacity model Interpreting X vs. N Data 0 100 200 300 400 500 02004006008001000 Concurrent users Throughput(req/s) PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 20 / 35
  • 21. Improved capacity model Interpreting R vs. N Data 0 100 200 300 400 0.00.20.40.60.8 Concurrent users Responsetime(s) Data PDQ PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 21 / 35
  • 22. Improved capacity model Adjusted PDQ Model library(pdq) usrmax <- 500 nknee <- 350 smean <- 0.4444 # Rmin seconds srate <- 1 / smean arate <- 2.1 # per user users <- seq(100, usrmax, 50) tp <- NULL rt <- NULL pdqr <- TRUE # PDQ Report for (i in 1:length(users)) { if (users[i] <= nknee) { Arate <- users[i] * arate # total arrivals pdq::Init("Tomcat Submodel") pdq::CreateOpen("requests", Arate) pdq::CreateMultiNode(users[i], "TCthreads") pdq::SetDemand("TCthreads", "requests", smean) pdq::SetWUnit("Reqs") pdq::Solve(CANON) tp[i] <- pdq::GetThruput(TRANS, "requests") rt[i] <- pdq::GetResponse(TRANS, "requests") .... Key differences: Old service time based on %CPU busy: S = 0.8 ms Rmin dominated by time inside external services New service time based on Rmin: S = 444.4 ms Tomcat threads are now parallel service centers in PDQ model Analogous to every supermarket customer getting their own checkout lane c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 22 / 35
  • 23. Improved capacity model Adjusted 2016 PDQ Outputs 0 100 200 300 400 500 02004006008001000 Concurrent users Throughput(req/s) PDQ Model of Oct 2016 Data 0 100 200 300 400 0.00.20.40.60.8 Concurrent users Responsetime(s) Data PDQ PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 23 / 35
  • 24. Improved capacity model Auto Scaling knee and pseudo-saturation 0 100 200 300 400 500 02004006008001000 Concurrent users Throughput(req/s) PDQ Model of Oct 2016 Data A/S policy triggered when instance CPU busy > 75% Induces pseudo-saturation at Nknee = 300 threads (vertical line) No additional Tomcat threads invoked above Nknee in this instance A/S spins up additional new EC2 instances (elastic capacity) c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 24 / 35
  • 25. Cost of Auto Scaling variants Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 25 / 35
  • 26. Cost of Auto Scaling variants AWS Scheduled Scaling A/S policy threshold CPU > 75% Additional EC2 instances require up to 10 minutes to spin up Based on PDQ model, considered pre-emptive scheduling of EC2s (clock) Cheaper than A/S but only 10% savings Use N service threads to size the number of EC2 instances required for incoming traffic Removes expected spikes in latency and traffic (seen in time series analysis) c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 26 / 35
  • 27. Cost of Auto Scaling variants AWS Spot Pricing Spot instances available at 90% discount over On-demand pricing Challenging to diversify instance types and sizes across the same group, e.g., Default instance type m4.10xlarge Spot market only has smaller m4.2xlarge type Forces manual reconfiguration of application Thus, CPU%, latency, traffic, no longer useful metrics for A/S policy Instead, use concurrency N as primary metric in A/S policy c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 27 / 35
  • 28. Cloudy economics Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 28 / 35
  • 29. Cloudy economics EC2 Instance Pricing Missed revenue? Max capacity line Spot instances On-demand instances Reserved instances Higher risk capex Lower risk capex Time Instances Instance capacity lines4 This is how AWS sees their own infrastructure capacity 4 J.D. Mills, “Amazon Lambda and the Transition to Cloud 2.0”, SF Bay ACM meetup, May 16, 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 29 / 35
  • 30. Cloudy economics Updated 2018 PDQ Outputs 0 100 200 300 400 500 600 050010001500 PDQ Model of Prod Data Mar 2018 Concurrent users Throughput(req/sec) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 0 100 200 300 400 500 600 0.00.10.20.30.40.5 PDQ Model of Prod Data Mar 2018 Concurrent users Responsetime(s) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 30 / 35
  • 31. Cloudy economics Performance Evolution 2016 – 2018 2016 daily users 20:00 01:00 06:00 11:00 16:00 150200250300350400450 UTC time (hours) Userrequests(N) 2018 daily users 20:00 01:00 06:00 11:00 16:00 0100200300400500600 UTC time (hours) Userrequests(N) Typical numero uno traffic profile Increasing cost-effective performance Date Rmin (ms) Xmax (RPS) Nknee Jul 2016 394.1 761.23 350 Oct 2016 444.4 675.07 300 Mar 2018 223.6 1135.96 254 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 31 / 35
  • 32. Cloudy economics Name of the Game is Chargeback Google Compute Engine also offers reserved and spot pricing Table 1: Google VM per-hour pricing5 Machine vCPUs RAM (GB) Price ($) Preempt ($) n1-umem-40 40 938 6.3039 1.3311 n1-umem-80 80 1922 12.6078 2.6622 n1-umem-96 96 1433 10.6740 2.2600 n1-umem-160 160 3844 25.2156 5.3244 Similarly for Microsoft Azure 5 TechCrunch, May 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 32 / 35
  • 33. Cloudy economics Microsoft Acquires GitHub (cloud) for $7.5 BB 6 GitHub Enterprise on-site or cloud instances on AWS, Azure, Google or IBM Cloud is $21 per user per month From Twitter: “Supporting the open source ecosystem is way more important to MS than anything else—the revenue they make from hosting OSS-based apps on Azure in the future will dwarf their current devtools revenue.” “[MS] isn’t the same company that [previousy] hated on open source, mostly because it’s [now] symbiotic to their hosting business. They didn’t start supporting open source from altruism!” 6 NOTE: That’s Bs, as in billions, not Ms c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 33 / 35
  • 34. Cloudy economics Summary Cloud services are more about economic benefit for the hosting company than they are about technological innovation for the consumer 7 Old-fashioned mainframe chargeback is back! 8 It’s incumbent on paying customers to minimize their own cloud services costs Meaningful cost-benefit decisions require ongoing performance analysis and capacity planning PDQ model presented here is a simple yet insightful example of cloud sizing and performance tools 9 Queueing model framework helps expose where hidden performance costs actually reside You only have the cloud capacity that you pay for 7 Not just plug-and-play. More like pay-and-pay! 8 Chargeback had disappeared with the advent of non-monolithic client-server architectures 9 PDQ Workshop is available at a discount to CMG members. Email classes@perfdynamics.com for details. c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 34 / 35
  • 35. Cloudy economics Questions? www.perfdynamics.com Castro Valley, California Training — including the PDQ Workshop Blog Twitter Facebook info@perfdynamics.com — any outstanding questions +1-510-537-5758 c 2018 Performance Dynamics Co. Exposing the Cost of Performance June 20, 2018 35 / 35