SlideShare a Scribd company logo
Web Performance Wars - Total Performance Consulting
2
3
Have you ever had a performance
related emergency?
4
5
TIME IS MONEY
6
TIME IS MONEY
7
GREATER CONSEQUENCES
War Stories
8
A Park’s Perf War
9
10
PARK RESERVATION PERF WAR
11
WORKLOAD
Business
Goals
Requirements Current
Usage
Historical
Usage
Projections
!
Success
12
PARK RESERVATION WORKLOAD
HISTORICAL RESERVATIONS COUNTS
Peak Reservations / hr
(Opening day for 9 parks)
3509
Peak Reservations on opening day 6950
Peak Reservations / hr
(Opening day for 5 other parks)
3400
Peak Reservations on 2nd busiest day 7900
13
WHAT DATA WAS ANALYZED
• Perfmon, PAL, and Yslow
• Key transaction response times
• Resource utilization
• Validation of proper application calls
14
BUSINESS LESSONS LEARNED
• Performance Considerations
• Solid Workload Model
• Peak busy min… High traffic days
15
TECHNICAL LESSONS LEARNED
• Not what it seems to be
• Always suspect the DB
• Disk I/O
• Front End Performance
• Validation of proper application calls
A Retailer’s Perf War
16
17
OVERVIEW
• Large Scale Test
• Quick Ramp to Full Load
• Need to Come from Multiple GEOs
18
BATTLE WOUNDS
19
RETAIL LESSONS
• Speak to CDN provider prior to any load test
• Ensure at least a 15 minutes ramp-up to full load
• Testing off hours
• Ensure Load distribution is across as many available
regions as possible
• No more than 300 Mb/s traffic should be generated
from one geo area + ISP combination
A Bank’s Perf War
20
21
THE CHALLENGE
• 1000+ Branches
• New software deployment
22
BANKING BATTLEGROUND
• IBM AIX Environment
• IBM HTTP Server ( Apache), WebSphere
Application Server, Oracle 10
• Web client front end (Javascript)
• F5 Load Balancers
23
BANKING STRATEGY
• Examine how traffic was being distributed
to the servers
• Suspected problem with app traffic load
balancing
24
BANKING VICTORY
• Investigated load balancer algorithms
• Made the change – problem solved!
Software performance testing
and its best practices.
25
26
MONITOR ALL RELEVANT COMPONENTS
27
HOW DOES THE DATA FROM THE TOOLS HELP?
28
ALSO
• Make use of collection and analysis tools readily
available for the platform
• Examine performance results data against
platform thresholds
• Document resource utilization issues
• Establish origins of adverse performance events
29
GET THE LOW HANGING FRUIT
Trend Micro Virus Scan
Overall Counter Instance Statistics
30
GET THE LOW HANGING FRUIT
CONDITION
PROCESS(*)IO DATA
OPERATIONS/SEC
MIN AVG MAX
HOURLY
TREND
STD
DEVIATION
10% OF
OUTLIERS
REMOVED
20%OUTLIERS
REMOVED
30% OF
OUTLIERS
REMOVED
More than
1000 data IO
operations
(network, disk,
or device IO)
per second
SQL1/NTRTScan 0 99 1,985 29 404 1 0 0
31
SOFTWARE PERFORMANCE ANTI-PATTERNS
Traffic didn’t crash the Obamacare site alone. Bad coding did too.
Excessive DB Hits
Organic
Inheritance
Insufficient DB
Expertise
No Active System
Monitoring for
Continuous
Improvement
Non-technical
PMs
The fastest path to a slowdown requires no action
32
ANTI-PATTERNS CONT’D
Excessive Object
Creation
One Lane Bridge
Unmeasure
Cache
Behavior
33
ANTI-PATTERNS EXAMPLE
To prioritize the issues for resolution, one possible ordering is
as follows:
1. SQLServer: SQL Errors Errors/sec
2. SQLServer: Access Methods Workfiles Created/sec
3. SQLServer: Locks Lock Timeouts/sec
4. SQLServer: Latches Latch Waits/sec
5. SQLServer: Deprecated Features Usage
34
DOCUMENT CHANGES BETWEEN DELIVERED BUILDS
35
DON’T IGNORE FRONT-END PERFORMANCE
• An under performing front end can
negate all downstream optimizations
to the end user
• Modernize legacy code
• Simplify functionality
• Handle exceptions
• Sometimes problem(s) exist on client
platform and not application
• 80 % of total response time is spent on
the front end
FRONT END ISSUES
Fewer HTTP Requests
Using a CDN
Some content not HTTP compressed
Image optimization opportunities
Minify JS and CSS
Large ASP.NET ViewState
Questions?
36
Amit Patel
Total Performance Consulting
@aapatel
apatel@totalperform.com
@TotalPerform
www.totalperform.com
37
Thank You.

More Related Content

PDF
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
PPTX
Welcome To Workforce Echoes
PPTX
Performance Consulting & Training Services
PDF
Tomorrow's Risk Today: Strategies for High Consequence Training
PPT
OD and Human Performance Technology
PDF
RoIT Consulting Company Services Presentation
PPT
Performance consulting and Taare Zameen par
PDF
Developing L&D Strategy that Lead to Business Results
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
Welcome To Workforce Echoes
Performance Consulting & Training Services
Tomorrow's Risk Today: Strategies for High Consequence Training
OD and Human Performance Technology
RoIT Consulting Company Services Presentation
Performance consulting and Taare Zameen par
Developing L&D Strategy that Lead to Business Results

Similar to Web Performance Wars - Total Performance Consulting (20)

PDF
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
PDF
Introduction to Apache Apex by Thomas Weise
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Dependable Systems - Introduction (1/16)
PDF
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
PDF
The Fine Art of Combining Capacity Management with Machine Learning
PPTX
Automation of the Drilling System: What has been done, what is being done, an...
PPTX
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
PPTX
7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PPTX
Observability - the good, the bad, and the ugly
PDF
Building an Experimentation Platform in Clojure
PPTX
Performance tuning Grails applications SpringOne 2GX 2014
PDF
Nonfunctional Testing: Examine the Other Side of the Coin
PDF
Synthetic and RUM - Best of bo
PDF
VMworld 2013: Strategic Reasons for Classifying Workloads for Tier 1 Virtuali...
PPTX
Getting Started with Splunk Enterprise
PDF
Flopsar tesacom-technical-introduction v1a-eng
PPTX
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
PPTX
IW16 Presentation_05 25 16
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Introduction to Apache Apex by Thomas Weise
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Dependable Systems - Introduction (1/16)
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
The Fine Art of Combining Capacity Management with Machine Learning
Automation of the Drilling System: What has been done, what is being done, an...
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Observability - the good, the bad, and the ugly
Building an Experimentation Platform in Clojure
Performance tuning Grails applications SpringOne 2GX 2014
Nonfunctional Testing: Examine the Other Side of the Coin
Synthetic and RUM - Best of bo
VMworld 2013: Strategic Reasons for Classifying Workloads for Tier 1 Virtuali...
Getting Started with Splunk Enterprise
Flopsar tesacom-technical-introduction v1a-eng
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
IW16 Presentation_05 25 16
Ad

Recently uploaded (20)

PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PDF
Chapter 5_Foreign Exchange Market in .pdf
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
NewBase 12 August 2025 Energy News issue - 1812 by Khaled Al Awadi_compresse...
PPTX
New Microsoft PowerPoint Presentation - Copy.pptx
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
Laughter Yoga Basic Learning Workshop Manual
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PDF
IFRS Notes in your pocket for study all the time
PPTX
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
HR Introduction Slide (1).pptx on hr intro
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
340036916-American-Literature-Literary-Period-Overview.ppt
Roadmap Map-digital Banking feature MB,IB,AB
Chapter 5_Foreign Exchange Market in .pdf
ICG2025_ICG 6th steering committee 30-8-24.pptx
NewBase 12 August 2025 Energy News issue - 1812 by Khaled Al Awadi_compresse...
New Microsoft PowerPoint Presentation - Copy.pptx
Nidhal Samdaie CV - International Business Consultant
Laughter Yoga Basic Learning Workshop Manual
Reconciliation AND MEMORANDUM RECONCILATION
COST SHEET- Tender and Quotation unit 2.pdf
IFRS Notes in your pocket for study all the time
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
Belch_12e_PPT_Ch18_Accessible_university.pptx
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Ad

Web Performance Wars - Total Performance Consulting

Editor's Notes

  • #2: Good evening everyone. Glad to see so many faces out there My name is Amit Patel and I founded Total Performance Consulting we are a QA and Performance Engineering company. I have over 15 year of exp in the software testing world. Today I am going to share some of those performance war stories some lessons learned as well as go over anti-patterns we have developed after working on many different software applications.
  • #3: Performance engineering is often an area overlooked until crisis strikes – but it does not have to be that way Performance Testing can add value on many fronts – it can help improve scalability and responsiveness, reduce hardware costs and reduce downtime Performance engineering is more of a ART then Science as the systems are always give clues to performance issues, but without proper tools or experience, they can go unnoticed until they manifest like a pop-up storm I like to think of us as the doctors of the web performance world The key to successful performance testing is being able to extract the hidden issues that normal testing activity will never uncover Due to the nature of performance we often find our selves in the trenches trying to fight an uphill battle
  • #4: Raise of hands? Anyone want to share there story?
  • #5: Performance is very important In many industries, time is money – so consistent application performance is a staple in the successful operation of the business Unfortunately, some organizations place little focus on performance...until there is a problem Data loss and downtime cost enterprises $17 billion in 2014 in Canada, according to the EMC Global Data Protection Index conducted by Vanson Bourne. You can see here.. Companies are loosing thousands $ per hour or even millions for some industries.
  • #7: Not only does it cost money when you are down but these days it costs if you are slow! Amazon estimated over a 1.6B in annual sales could be lost page loads that are one second slower Google has calculated that by slowing its search results by just four tenths of a second they could lose 8 million searches per day – fewer ads served 74% of users leave a website if it doesn’t load on their smartphone after 5 seconds
  • #8: Imagine not being able to get money from an ATM because of an enterprise software “glitch” – not the kind of attention you want Employee productivity , brand, live and death People lives are at stake… This is why we fight performance wars
  • #11: Park reservation system with .NET env. Camping,rvs, cabins, etc.. We were brought in at the last minute… Like U2, event tickets Last year did no handle anticipated load and implemented a Waiting room as a bridge gap ----- The principal problem being addressed is a peak load scenario, very short and very large. Matter of minutes hit year peak volumes Camping season has opening date for online reservations, much like purchasing event tickets There is always a rush to reserve the prime campsites at opening – this produces the peak load performance opportunity we were engaged to address Waiting room software implemented a bridge gap last year… The new software release of the reservation system was not handling the anticipated load for this activity; customer was not exactly sure why The subject environment is Windows based, uses IIS and SQL Server. 8 webservers and 16 core SQL server Application is a parks reservation web application, users are looking to reserve anything from Campsites, RVs, swimming, picnic areas, cabins etc.
  • #12: Like with most performance engagements we had to sit and understand the business and the Business discussions – understanding the transactions that drive application performance Requirement gathering – isolating the performance-dependent functions, high volume, mission critical functions Current Usage - what is the interaction of features in the overall performance landscape? Historical Usage – has that interaction changed over time? Growth Projections (ie 25%) Historical data only and not future so we worked to discuss this and come up with growth projections Improper planning and can cause issues with testing – without testing like you production workload there are risks in the strategy
  • #13: What we were up against… It’s not really an hour we are concerned about we need to ensure we need to process a large number reservations in matter of minutes… something we helped client understand Really in first 5 to 10 minutes Peak load for the whole year… ensure application was sized and coded for a few minutes of work a year Concentrated time frame Talking to their business and marketing… what do they expect – cross functional conversation… Performance means different things … i.e. marketing…technical business
  • #14: All relevant data was captured via our load testing and performance monitoring tools Key transaction response times Resource Utilization – Memory, CPU, Network, Disk I/O Validation of proper application calls (Deprecated SQL calls check) Data was evaluated over several test cycles Performance Analysis of Logs – provides a very good deep dive of many different metrics and helps uncover potential performance performance, from there we analyze and come up with a prioritized list of issues.
  • #15: We started Late and had limited time to tackle all performance issue due to go live date… Consideration of performance is treating it like code deliver and QA as another line item in the project plan. Meaning we need dedicated resources, environment, support form dev, qa, ops, etc Really understanding transaction mix and usage pattern for application We were not looking at hour numbers but looking at minute numbers – we had to talk to the business.. Important to include all folks in the conversation Talk about the workload model and busy
  • #16: Talk about the proper application calls… talk about it as story of the engagement (what we looked at and tell story) Normal suspects. Didn’t hit load target, high response, high web CPU, web queuing and low DB utilization.. (High IIS Queuing and High Web CPU but low DB CPU) Performance of the database, outside of configuration issues, un-optimized database calls and outdated calls created bulk of the issues.. When things slow down on DB… DB from memory to disk and seems to slow everything down.. Couldn’t completed until calls to come from the DB… and finish on the DB… Simple as Index and or optimized queries or not properly sized DB (hardware not as common as it used to be)… most project seem to not employ DB developers.. A true DB developer vs. a DBA. It’s all functionally setup properly but need to setup fast and limited to what they can do around impact for DB performance Front end performance when the smart phone became popular … started profile of the front end started increasing and implement eye candy and functionally great exp Outdated DB calls were a consistent contributor to system activity SQL ERROR (what kind?) DB Locks Full Scans Memory
  • #17: Next war we found supporting was around preparing for black Friday for a retail client… We had consider not just back end server testing but also multiple Geos and ensuring we measure response time for every location Very quick arrival rate of customers as you can imagine people waiting for the clock to hit mid-night or 3am Browse, search, checkout, abondment
  • #18: Next war we found supporting was around preparing for black Friday for a retail client… We had consider not just back end server testing but also multiple Geos and ensuring we measure response time for every location Very quick arrival rate of customers as you can imagine people waiting for the clock to hit mid-night or 3am Browse, search, checkout, abondment
  • #19: Ran into 403 errors and come to find out the CDN was blocking us Used a CDN and the client discovering… find out impact if we run performance with these link active … Charge by usage and concentrated traffic and shut you down… Test was going for 30 – 45 minutes, seeing reasonable response times some high response times but moving along Started seeing large number of 403 errors… across the board for all transactions all domains… Once we did some analysis of the 403 errors and in speaking to the CDN provider.. Turns out they started blocking the traffic as it looked like an attack on the site.
  • #20: First of all know your CDN is involved.. Such usage (CDN) should be disclosed; PE should NOT have to uncover this by other means Testing through CDNs can have a financial impact to the customer if their agreement is based on the amount of traffic generated So, for example, if Amazon EC2 has a DC in San Jose with L3 as the ISP and another in San Jose that uses Quest, then each can put out up to 300 Mb/s; If both DCs are in San Jose and both are on Quest, then no more than 300 should come from the two combined. This continue to apply esp now that we are seeing larger scale requirements for testing.
  • #22: Large regional bank, 1000+ branches New software deployment to over 7000 users The bank was facing a enterprise outage and the servers were getting overrun with users until each one failed. As you can imagine this downtime was a huge financial hit to the bank Enterprise Banking application Rollout of new software encounters an enterprise wide outage for approximately 7000 users. Servers were being sequentially overrun with users until they failed. Downtime causes financial hit to the bank
  • #23: Rich client app, JavaScript based Infrastructure recently upgraded, ample capacity
  • #24: Initially looking at server logs and OpNet Traces… tracking connections on each server via the server logs… and monitoring capture traces… noticed connections were not evenly distributed and started seeing spikes and one servers 400 connections and 50 across the others Like app expert from CA tool (Compuware tool)
  • #25: Their standard configuration for all web app was Round robin, so this is what we do and they rolled out with that … increased load which is where least connections was leveraged… So suspect all and assume none In previous meetings with the client , informed them application performed best in least connections configuration on F5 load balancer. Client network team ignored our advice and configured for round robin
  • #26: Now that we went over some war stories and lessons learned I will dive into some best practices we have accumulated over the years.
  • #27: Active monitoring can minimize detection time for performance events Provides historical insight into system performance Helps in discovering performance trends easier Make use of RUM and synthetic transaction monitoring where feasible (from Production monitoring data for historical) Understand your worload model, users flows, bounce rates, Tools we use?
  • #28: It helped and provided visual insight into performance events as they unfolded Provided actionable insight into system performance issues – code and environment Helped expose potential database performance and stability issues resulting from the use of deprecated calls Helped establish clear relationships between certain system actions and performance degradation Along with using 3rd party tools, sometime we use homegrown monitoring to dive deep in to application analysis and provides the insight we need. Make use of collection and analysis tools readily available for the platform Examine performance results data against platform thresholds Document resource utilization issues Establish origins of adverse performance events Analyze impacts of delayed responses on the front end due to slow or unresponsive back end calls Generic and then site the example: Spike in workfiles created/sec at the DB caused a corresponding spike in CPU on the front end
  • #29: Many platforms have tools available (either from the vendor or the platform ecosystem) to provide clear insight into system operations against platform thresholds – every application support.. A template file a thresholds for each technology… Categorize what we are seeing (establish origins ) pinpoint the source that was the instigator from the data Correlation of workfiles tp web cpu Front end code or beyond front end (back end issues to front end)
  • #30: Most tests produce at least one obvious performance recommendation from the results if performance issues are encountered. Configuration settings, third party app interaction – look at all of it Ensure vendor tuning recommendations have been followed Test Results often point out glaring performance issues Cover system front to back – opportunities can exist on every tier Use vendor performance tools to help unmask performance problems Many platform tools are available for the cost of a download Sometimes you a nice nugget of gold without trying hard, just because Picture of a gold nugget For Example….
  • #31: During the park reservation engagement since we had less than 1 month… low cost and high return… In this case, the Trend Micro virus scan was causing elevated disk I/O during performance test execution. One checkbox eliminated this issue It seem to occur with large files like sQL and VMWare.. So we disabled the digital signature cache and drastically reduced the Disk I/O on the DB This process is associated with Trend Micro antivirus; here is the fix from Trend Micro: This issue is encountered on computers installed with applications that have large files like SQL and VMWare. The workaround solution is to disable the digital signature cache on the affected machine. 1. Open the OfficeScan server's web console. 2. Click Agents > Agent Management 3. Click Networked Computers > Client Management 4. Click Privileges and Other Settings for the affected machine/group. 5. Go to the Other Settings tab. 6. Under the Cache Settings for Scan, untick the Enable the digital signature cache check box. 7. Click Save and wait until the changes are applied.
  • #32: During the park reservation engagement since we had less than 1 month… low cost and high return… In this case, the Trend Micro virus scan was causing elevated disk I/O during performance test execution. One checkbox eliminated this issue It seem to occur with large files like sQL and VMWare.. So we disabled the digital signature cache and drastically reduced the Disk I/O on the DB This process is associated with Trend Micro antivirus; here is the fix from Trend Micro: This issue is encountered on computers installed with applications that have large files like SQL and VMWare. The workaround solution is to disable the digital signature cache on the affected machine. 1. Open the OfficeScan server's web console. 2. Click Agents > Agent Management 3. Click Networked Computers > Client Management 4. Click Privileges and Other Settings for the affected machine/group. 5. Go to the Other Settings tab. 6. Under the Cache Settings for Scan, untick the Enable the digital signature cache check box. 7. Click Save and wait until the changes are applied.
  • #33: Recycle objects rather than create new every time First, recognize the existence of one lane bridges; create alternate paths (sequential flows) – touchpoimt bridge Periodic review Measure and understand cache behavior – helps avoid performance hits – memory caching, maintenance thing, resources properly There are many other anti-patterns around software performance – these are relevant to the present case For Example…
  • #34: Having an experienced DB developer on staff can go a long way toward keeping bad design/coding practices from stymieing your application Based on our anti-patterns of extensive DB hits and lack of proper DB development exp. During our first park reservation case study we saw all the usual suspects of DB locks, SQL errors, exessive workfile creation Item#1 -> this first priority is proposed because errors were introduced during the last run - categorized as user errors, info errors and kill connection errors. These should be addressed before any other issue analysis is undertaken. Item#2 -> Number 2 in the list is here because of the possible absence of index(es) on queries and the potential of inefficient queries slowing response. Workfiles are created when physical memory is exhausted, spilling activity over to physical disks. This activity is occurring at a much higher rate than what would considered acceptable. Items #3 & 4 -> Numbers 3 and 4 are related to number 2 because they are occurring in a higher frequency due to issue number 2. Item #5 -> Number 5 is here because some SQL functions current utilized in the existing code have been superseded by newer functions in later versions of SQL Server. Continued use of these functions could produce unwanted results; code should be updated.
  • #35: Quality release notes are essential to maintaining software integrity as well as help to track change effects on performance Think of them as the trail markers one leaves on a hike into unfamiliar surroundings This process can quickly point out sources of performance anomalies Provides another data point in tracking performance trends and building your own anti-pattern list Helps maintain the integrity of the baseline system performance
  • #36: Often times legacy code is brought forward and merged into a new app simply because it worked without significant consideration given to the New architectural world it will now have to function in Know when to switch from investigating the app to the platform Get involved early in design process to head off performance issues before they reach production Picture of a stream with bottleneck?? An under performing front end can negate all downstream optimizations to the end user Modernize legacy code to take advantage of optimizations in platform and programming language Simplify functionality where possible Handle exceptions for all critical function paths Sometimes problem(s) exist on client platform and not application 80 % of total response time is spent on the front end