SlideShare a Scribd company logo
Scaling for Success:
Lessons from handling peak loads on Azure
Dibran Mulder
Dibran Mulder
CTO / Azure Solutions Architect
Caesar Groep / Cloud Republic
@dibranmulder
Particular Recognized Professional
Co-Host
www.devtalks.nl
@devtalkspodcast
Every January
• > 70% of all primary schools in the Netherlands
take tests on our platform.
• Pre-Covid paper testing was dominant.
• New student tracking platform first time use.
Tuesday 17th of January
8:15 – 8:30 School opening
8:30 – 9:00 Opening by teacher
9:00 – 9:05 Entire country starts taking tests
9:05 – 9:10 Wait for Azure to Scale
9:10 – 10:00 Continue with the test
10:00 – 11:00 Break & Play outside
11:00 – 12:00 Take second test
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
Web Traffic in the Morning
CPU percentage correlates to Https traffic
Scaling can’t keep up
What happens when you make the newspaper?
Hi Mr. Manager!
• Your alerts, monitoring will go off.
• Service Care is getting flooded with calls
• Your manager is going to sit next to you
• Trust has been violated.
• Your work is being monitored all the time.
Would you?
Scale up using the Azure portal
despite your Infra as Code Policy.
Scale up using Infra as Code
and do a deployment.
Fix the problem yourself as a
team lead.
Let a team member fix the
problem.
Take ownership and act, we are
going to scale up, now!
Organize a meeting and
discuss the best approach.
App Service Plan Scaling
App Service Plan Scaling
• Take a baseline for the night e.g. 2 instances
• Take a baseline for working hours, e.g. 10
instances
• Aggressive autoscaling > 60% CPU increase to
max
• Decrease over time.
App Service Plan Scaling
• Scaling rules with a 5-minute evaluation time are to slow in certain use cases.
• Its better to scale aggressively and decrease over time, it won’t hurt costs that bad.
• Pre-provisioning might be helpful in some cases.
• Its hard to be cost effective and confident at the same time.
• Be prepared to get shit from your nephews and nieces.
• Haven’t we tested right?
• We have load tested the system with a ramp up test up to 5k concurrent users.
• We have tested based on non-functionals according to pre-covid.
• We haven’t tested with 150k real users hitting the system in a 5-minute window
• We didn’t expect a paradigm shift in the adoption of digital testing.
Lessons learned
What is the next weakest link?
Application
Application
authenticatie.x.nl
Application
Social Logins
(Google, Facebook, Twitter)
Industry standards
(Basispoort, Entree)
Azure Active Directory
(Internal employees)
Custom Identity
Providers
(Startwoord, Portal)
Federated AAD
(Partners)
Azure B2C
OpenId Connect
ID Token & Access Token
Refresh Token
API’s
Client configuration
Saml / OpenId Connect
IdentityServer
Persisted Grant Storage
Refresh Token
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
Identity Server Persisted Grants
• Refresh tokens are ≈ 515bytes
• 900 sec lifetime
• 15 days lineage
• 150k students / 2 hours per day ≈ 10 refresh tokens per student
• 10k teachers / 8 hours per day ≈ 40 refresh tokens per teacher
• Students ≈ 1.5 gb per day
• Teachers ≈ 600mb per day
• Database growth of almost 2.5 gb per day
DTU Load on our Identity Server database
Identity Server Persisted Grants
• Users made extensive use of the online testing environment for students and the student tracking system for
teachers.
• Our composable front-end architecture 3x-ed the amount generated of refresh tokens.
• Refresh tokens are kept in the Persisted Grant Storage to make sure the lineage of tokens is correct. And they
are not reused.
• Database grew to 100gb in roughly 2 weeks.
• Scaling a database from S2-Sx takes up to 1min per GB
• Scaling up a database under stress is taking significantly longer…
• IdentityServer doesn’t cleanup by default but has a TokenCleanup feature.
services.AddIdentityServer()
.AddOperationalStore(options =>
{
// this enables automatic token cleanup. this is optional.
options.EnableTokenCleanup = true;
options.TokenCleanupInterval = 3600; // interval in seconds (default is 3600)
});
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
ALTER PROCEDURE [dbo].[PersistedGrantCleanup]
AS
BEGIN
SET NOCOUNT ON;
DECLARE @CurrentDateTime as datetime2 = GETDATE();
EXEC sp_autostats 'dbo.PersistedGrants', 'OFF';
WHILE (@@ROWCOUNT > 0)
BEGIN
WAITFOR DELAY '00:00:05’
DELETE TOP(3000)
FROM PersistedGrants
WHERE Expiration < @CurrentDateTime;
END
EXEC sp_autostats 'dbo.PersistedGrants', 'ON';
END
Manual Cleanup
• Exactly 15 days (1296000 seconds) after our initial burst of users
DTU issues are taking place.
• Don’t let IdentityServer cleanup tokens because it uses Entity
Framework
• Competing cleanups with staging and production slots
• Create a stored procedure with a simple Logic App or Azure Function
• Make sure not to stress the database rate limit / throttle the stored
procedure.
Persisted Grants Store
• Composable UI architecture can increase the load on your IAM.
• Refresh token lineage is being stored for security reasons.
• IdentityServer is a good product but lacks database maintenance options.
• Scaling up a database can take a significant amount of time.
• Manually altering infra such as scaling a databases yield source code issues.
• Should I update this in dev or main or a release branch? Or create a hotfix?
• If I deploy a hotfix will this overwrite my scaling settings?
• Do you have a separate pipeline for Infra as Code?
Lessons learned
Ok, so everyone was able to take a test?
We’re good right?
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
Data
Communication
Hosting .NET, Java, JavaScript, Python
IaaS, PaaS, FaaS
Azure Functions
REST, gRPC, Messaging
Azure Service Bus + NServiceBus
SQL, CosmosDb, Redis.
Azure SQL
Microservices in Azure
Https
Point-to-Point Pub / Sub
Single receiver Multiple receivers
Synchrnonous
Asynchronous
Messaging
• REST is the de-facto standard for communication.
• Is suitable for one-to-one communication.
• Lots of libraries and programming languages
support it. Truly technology agnostic.
• Doesn’t support guaranteed delivery.
• Messaging is the better alternative
• Asynchronous in nature, enables recoverability,
resilience.
• Point-to-Point communication for one-to-one
communication.
• Enables to-to-many communication with pub/sub
patterns.
NServiceBus
NServiceBus: Transactional consistency
• In an event-driven architecture always incorporate
transactional consistency.
• The transaction scope of several processes are linked to
eachother:
1. Handling the incoming message
(StudentChanged)
2. Updating the database
(Database Update)
3. Sending an event
(UserChanged)
• If any of these steps fail all transactions are rolled back.
• NServiceBus has APIs to help with this.
Administration
ParnaSys / Dotcom
Authentication
Service
Bus
GET: Teachers/Students
NServiceBus: Saga’s
• Workflow consisting out of several messages being
handled
• Is started by specific messages
• Handles certain messages
• Somewhat comparable to Azure Durable Functions /
Azure Durable Entities
• State is stored in persistence of choice
• Orchestration is handled via Service bus messages.
• NServiceBus Saga persistence
• SQL Server, MySql, PostgreSql, Azure Table Storage,
MongoDb, RavenDb and more.
Some (hard) lessons learned on
Event driven Architecture
Application
Application
Service
Bus
Testing environment
Post processing
Reporting
Test processing
• Thousands of events come in from the online testing environment.
• Test started, paused, finished, …
• Microservices act on events
• Notify teachers on test status.
• Close tests when started/finished
• Analyze answers after test, such as:
• d/dt, au/ou, ei/ij-analysis
• Categorial mistakes, fractions, multiplications, etc.
• Historical analysis
• Did the student, class, group, school improve over time?
• Sync data with 3rd Party systems such as LAS
• …
Test processing – LOB systems
Service
Bus
Post processing
Line of Business System:
Test products
REST: Get fault patterns
Fix tests
• People work in LOB (line of business) systems during
business hours.
• Expect data to be locked or incomplete.
• Always validate data on your side of the system.
• Use caching with LOB systems. They are 99% of the time
not build for scale.
• Retry policy of 10 times, message will be dead lettered
after 10 retries.
• Retrying exposes LOB systems to even more load.
• Back-off on functional errors, if the test data isn’t
there retrying makes no sense.
PDF Report generation
• In case of the Doorstroomtoets PDF’s needed to be
generated for students and their parents/guardians.
• The External Service only had a REST API
• We used an Azure Function with Service Bus trigger.
• The external service hosted Puppeteer inside an App
Service.
• 100k reports in 1 afternoon didn’t work well.
• Service Pulse saved us, retried in badges.
Report Generator
External System
Service
Bus
POST: Generate Reports
Open Chromium Page
Save as PDF
Service Pulse
• Messaging only works well if you design systems well.
• Commands vs Events
• Point-to-point vs Pub Sub
• Service bus topology
• Distinguish between functional and transient exceptions. Don’t retry on functional
exceptions or backoff for a longer period.
• Out of order event processing is inevitable on large scale
• Idempotent, replaying
• Azure Service bus might refuse connections.
• Azure Service Bus Exception : Cannot allocate more handles. The maximum
number of handles is 4999
• Audit logging enlarges the problem.
• Prefer batching over streaming data processing in SQL Server.
• Build for resilience and you will most likely not lose data.
• You can’t live without a Service Bus monitoring solution with thousands of messages
and dozens of services.
• Transactional consistency helps to avoid Zombie records and Ghost messages
Lessons learned
Regaining trust with Postmortems
Some templates available at:
https://github.com/dastergon/postmortem-templates/blob/master/templates/postmortem-template-azure.md
Title (incident)
Date
Summary of impact
Customer impact
Root cause and mitigation
Next steps
Regaining trust with Postmortems
In the moment:
• Take ownership of the situation. As a DevOps team you must solve the situation.
• Don’t act in emotion, reason with your team.
• As a Team Lead one should:
• Shield your team from stakeholders.
• Don’t fix it at your own. Involve team members.
• Send it yourselves to the corresponding stakeholders.
After the moment:
• Trust has been violated; you must regain it.
• Discuss in the team what went wrong.
• Write a postmortem, be very specific. What was the problem? How did you deal with it? How are you going to prevent
this?
Questions?
cloudrepublic.nl
d.mulder@cloudrepublic.nl
Dibran Mulder
DibranMulder

More Related Content

PPTX
Azure architecture design patterns - proven solutions to common challenges
PPTX
Sql azure cluster dashboard public.ppt
PDF
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
PPTX
Tech talk microservices debugging
PPTX
Debugging Microservices - key challenges and techniques - Microservices Odesa...
PDF
Resilience Planning & How the Empire Strikes Back
PDF
Mtc learnings from isv & enterprise interaction
PPTX
Mtc learnings from isv & enterprise (dated - Dec -2014)
Azure architecture design patterns - proven solutions to common challenges
Sql azure cluster dashboard public.ppt
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Tech talk microservices debugging
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Resilience Planning & How the Empire Strikes Back
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise (dated - Dec -2014)

Similar to Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus (20)

PPTX
WinOps Conf 2016 - Michael Greene - Release Pipelines
PDF
Microservices for java architects it-symposium-2015-09-15
PDF
Scaling tappsi
PPTX
Eric Proegler Oredev Performance Testing in New Contexts
PDF
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
PDF
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
PDF
Expect the unexpected: Prepare for failures in microservices
PPTX
Monitoring Containerized Micro-Services In Azure
PPTX
.NET microservices with Azure Service Fabric
PPTX
Exploring Twitter's Finagle technology stack for microservices
PPTX
Building a document e-signing workflow with Azure Durable Functions
PPTX
Scaling Systems: Architectures that grow
PDF
12-Step Program for Scaling Web Applications on PostgreSQL
PPTX
Making communication across boundaries simple with Azure Service Bus
PDF
Building data intensive applications
PPTX
Building Cloud Ready Apps
PDF
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
PDF
Getting Started on Google Cloud Platform
PPTX
Best Features of Azure Service Bus
PDF
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
WinOps Conf 2016 - Michael Greene - Release Pipelines
Microservices for java architects it-symposium-2015-09-15
Scaling tappsi
Eric Proegler Oredev Performance Testing in New Contexts
AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Expect the unexpected: Prepare for failures in microservices
Monitoring Containerized Micro-Services In Azure
.NET microservices with Azure Service Fabric
Exploring Twitter's Finagle technology stack for microservices
Building a document e-signing workflow with Azure Durable Functions
Scaling Systems: Architectures that grow
12-Step Program for Scaling Web Applications on PostgreSQL
Making communication across boundaries simple with Azure Service Bus
Building data intensive applications
Building Cloud Ready Apps
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
Getting Started on Google Cloud Platform
Best Features of Azure Service Bus
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Ad

More from Particular Software (20)

PDF
Beyond simple benchmarks—a practical guide to optimizing code
PDF
An exception occurred - Please try again
PPTX
Tales from the trenches creating complex distributed systems
PPTX
Got the time?
PPTX
Implementing outbox model-checking first
PPTX
Reports from the field azure functions in practice
PPTX
Finding your service boundaries - a practical guide
PPTX
Decomposing .NET Monoliths with NServiceBus and Docker
PDF
DIY Async Message Pump: Lessons from the trenches
PDF
Share the insight of ServiceInsight
PPTX
What to consider when monitoring microservices
PPTX
Making communications across boundaries simple with NServiceBus
PPTX
How to avoid microservice pitfalls
PDF
Connect front end to back end using SignalR and Messaging
PDF
Async/Await: NServiceBus v6 API Update
PDF
Async/Await: TPL & Message Pumps
PDF
Async/Await Best Practices
PPTX
Making workflow implementation easy with CQRS
PPTX
Cqrs but different
PPTX
Asynchronous Messaging with NServiceBus
Beyond simple benchmarks—a practical guide to optimizing code
An exception occurred - Please try again
Tales from the trenches creating complex distributed systems
Got the time?
Implementing outbox model-checking first
Reports from the field azure functions in practice
Finding your service boundaries - a practical guide
Decomposing .NET Monoliths with NServiceBus and Docker
DIY Async Message Pump: Lessons from the trenches
Share the insight of ServiceInsight
What to consider when monitoring microservices
Making communications across boundaries simple with NServiceBus
How to avoid microservice pitfalls
Connect front end to back end using SignalR and Messaging
Async/Await: NServiceBus v6 API Update
Async/Await: TPL & Message Pumps
Async/Await Best Practices
Making workflow implementation easy with CQRS
Cqrs but different
Asynchronous Messaging with NServiceBus
Ad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Understanding_Digital_Forensics_Presentation.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Machine learning based COVID-19 study performance prediction

Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus

  • 1. Scaling for Success: Lessons from handling peak loads on Azure Dibran Mulder
  • 2. Dibran Mulder CTO / Azure Solutions Architect Caesar Groep / Cloud Republic @dibranmulder Particular Recognized Professional Co-Host www.devtalks.nl @devtalkspodcast
  • 3. Every January • > 70% of all primary schools in the Netherlands take tests on our platform. • Pre-Covid paper testing was dominant. • New student tracking platform first time use.
  • 4. Tuesday 17th of January 8:15 – 8:30 School opening 8:30 – 9:00 Opening by teacher 9:00 – 9:05 Entire country starts taking tests 9:05 – 9:10 Wait for Azure to Scale 9:10 – 10:00 Continue with the test 10:00 – 11:00 Break & Play outside 11:00 – 12:00 Take second test
  • 6. Web Traffic in the Morning
  • 7. CPU percentage correlates to Https traffic
  • 9. What happens when you make the newspaper?
  • 10. Hi Mr. Manager! • Your alerts, monitoring will go off. • Service Care is getting flooded with calls • Your manager is going to sit next to you • Trust has been violated. • Your work is being monitored all the time.
  • 11. Would you? Scale up using the Azure portal despite your Infra as Code Policy. Scale up using Infra as Code and do a deployment. Fix the problem yourself as a team lead. Let a team member fix the problem. Take ownership and act, we are going to scale up, now! Organize a meeting and discuss the best approach.
  • 12. App Service Plan Scaling
  • 13. App Service Plan Scaling • Take a baseline for the night e.g. 2 instances • Take a baseline for working hours, e.g. 10 instances • Aggressive autoscaling > 60% CPU increase to max • Decrease over time.
  • 14. App Service Plan Scaling
  • 15. • Scaling rules with a 5-minute evaluation time are to slow in certain use cases. • Its better to scale aggressively and decrease over time, it won’t hurt costs that bad. • Pre-provisioning might be helpful in some cases. • Its hard to be cost effective and confident at the same time. • Be prepared to get shit from your nephews and nieces. • Haven’t we tested right? • We have load tested the system with a ramp up test up to 5k concurrent users. • We have tested based on non-functionals according to pre-covid. • We haven’t tested with 150k real users hitting the system in a 5-minute window • We didn’t expect a paradigm shift in the adoption of digital testing. Lessons learned
  • 16. What is the next weakest link?
  • 17. Application Application authenticatie.x.nl Application Social Logins (Google, Facebook, Twitter) Industry standards (Basispoort, Entree) Azure Active Directory (Internal employees) Custom Identity Providers (Startwoord, Portal) Federated AAD (Partners) Azure B2C OpenId Connect ID Token & Access Token Refresh Token API’s Client configuration Saml / OpenId Connect IdentityServer Persisted Grant Storage Refresh Token
  • 19. Identity Server Persisted Grants • Refresh tokens are ≈ 515bytes • 900 sec lifetime • 15 days lineage • 150k students / 2 hours per day ≈ 10 refresh tokens per student • 10k teachers / 8 hours per day ≈ 40 refresh tokens per teacher • Students ≈ 1.5 gb per day • Teachers ≈ 600mb per day • Database growth of almost 2.5 gb per day
  • 20. DTU Load on our Identity Server database
  • 21. Identity Server Persisted Grants • Users made extensive use of the online testing environment for students and the student tracking system for teachers. • Our composable front-end architecture 3x-ed the amount generated of refresh tokens. • Refresh tokens are kept in the Persisted Grant Storage to make sure the lineage of tokens is correct. And they are not reused. • Database grew to 100gb in roughly 2 weeks. • Scaling a database from S2-Sx takes up to 1min per GB • Scaling up a database under stress is taking significantly longer… • IdentityServer doesn’t cleanup by default but has a TokenCleanup feature. services.AddIdentityServer() .AddOperationalStore(options => { // this enables automatic token cleanup. this is optional. options.EnableTokenCleanup = true; options.TokenCleanupInterval = 3600; // interval in seconds (default is 3600) });
  • 23. ALTER PROCEDURE [dbo].[PersistedGrantCleanup] AS BEGIN SET NOCOUNT ON; DECLARE @CurrentDateTime as datetime2 = GETDATE(); EXEC sp_autostats 'dbo.PersistedGrants', 'OFF'; WHILE (@@ROWCOUNT > 0) BEGIN WAITFOR DELAY '00:00:05’ DELETE TOP(3000) FROM PersistedGrants WHERE Expiration < @CurrentDateTime; END EXEC sp_autostats 'dbo.PersistedGrants', 'ON'; END Manual Cleanup • Exactly 15 days (1296000 seconds) after our initial burst of users DTU issues are taking place. • Don’t let IdentityServer cleanup tokens because it uses Entity Framework • Competing cleanups with staging and production slots • Create a stored procedure with a simple Logic App or Azure Function • Make sure not to stress the database rate limit / throttle the stored procedure.
  • 25. • Composable UI architecture can increase the load on your IAM. • Refresh token lineage is being stored for security reasons. • IdentityServer is a good product but lacks database maintenance options. • Scaling up a database can take a significant amount of time. • Manually altering infra such as scaling a databases yield source code issues. • Should I update this in dev or main or a release branch? Or create a hotfix? • If I deploy a hotfix will this overwrite my scaling settings? • Do you have a separate pipeline for Infra as Code? Lessons learned
  • 26. Ok, so everyone was able to take a test? We’re good right?
  • 28. Data Communication Hosting .NET, Java, JavaScript, Python IaaS, PaaS, FaaS Azure Functions REST, gRPC, Messaging Azure Service Bus + NServiceBus SQL, CosmosDb, Redis. Azure SQL Microservices in Azure
  • 29. Https Point-to-Point Pub / Sub Single receiver Multiple receivers Synchrnonous Asynchronous Messaging • REST is the de-facto standard for communication. • Is suitable for one-to-one communication. • Lots of libraries and programming languages support it. Truly technology agnostic. • Doesn’t support guaranteed delivery. • Messaging is the better alternative • Asynchronous in nature, enables recoverability, resilience. • Point-to-Point communication for one-to-one communication. • Enables to-to-many communication with pub/sub patterns.
  • 31. NServiceBus: Transactional consistency • In an event-driven architecture always incorporate transactional consistency. • The transaction scope of several processes are linked to eachother: 1. Handling the incoming message (StudentChanged) 2. Updating the database (Database Update) 3. Sending an event (UserChanged) • If any of these steps fail all transactions are rolled back. • NServiceBus has APIs to help with this. Administration ParnaSys / Dotcom Authentication Service Bus GET: Teachers/Students
  • 32. NServiceBus: Saga’s • Workflow consisting out of several messages being handled • Is started by specific messages • Handles certain messages • Somewhat comparable to Azure Durable Functions / Azure Durable Entities • State is stored in persistence of choice • Orchestration is handled via Service bus messages. • NServiceBus Saga persistence • SQL Server, MySql, PostgreSql, Azure Table Storage, MongoDb, RavenDb and more.
  • 33. Some (hard) lessons learned on Event driven Architecture
  • 34. Application Application Service Bus Testing environment Post processing Reporting Test processing • Thousands of events come in from the online testing environment. • Test started, paused, finished, … • Microservices act on events • Notify teachers on test status. • Close tests when started/finished • Analyze answers after test, such as: • d/dt, au/ou, ei/ij-analysis • Categorial mistakes, fractions, multiplications, etc. • Historical analysis • Did the student, class, group, school improve over time? • Sync data with 3rd Party systems such as LAS • …
  • 35. Test processing – LOB systems Service Bus Post processing Line of Business System: Test products REST: Get fault patterns Fix tests • People work in LOB (line of business) systems during business hours. • Expect data to be locked or incomplete. • Always validate data on your side of the system. • Use caching with LOB systems. They are 99% of the time not build for scale. • Retry policy of 10 times, message will be dead lettered after 10 retries. • Retrying exposes LOB systems to even more load. • Back-off on functional errors, if the test data isn’t there retrying makes no sense.
  • 36. PDF Report generation • In case of the Doorstroomtoets PDF’s needed to be generated for students and their parents/guardians. • The External Service only had a REST API • We used an Azure Function with Service Bus trigger. • The external service hosted Puppeteer inside an App Service. • 100k reports in 1 afternoon didn’t work well. • Service Pulse saved us, retried in badges. Report Generator External System Service Bus POST: Generate Reports Open Chromium Page Save as PDF
  • 38. • Messaging only works well if you design systems well. • Commands vs Events • Point-to-point vs Pub Sub • Service bus topology • Distinguish between functional and transient exceptions. Don’t retry on functional exceptions or backoff for a longer period. • Out of order event processing is inevitable on large scale • Idempotent, replaying • Azure Service bus might refuse connections. • Azure Service Bus Exception : Cannot allocate more handles. The maximum number of handles is 4999 • Audit logging enlarges the problem. • Prefer batching over streaming data processing in SQL Server. • Build for resilience and you will most likely not lose data. • You can’t live without a Service Bus monitoring solution with thousands of messages and dozens of services. • Transactional consistency helps to avoid Zombie records and Ghost messages Lessons learned
  • 39. Regaining trust with Postmortems Some templates available at: https://github.com/dastergon/postmortem-templates/blob/master/templates/postmortem-template-azure.md Title (incident) Date Summary of impact Customer impact Root cause and mitigation Next steps
  • 40. Regaining trust with Postmortems In the moment: • Take ownership of the situation. As a DevOps team you must solve the situation. • Don’t act in emotion, reason with your team. • As a Team Lead one should: • Shield your team from stakeholders. • Don’t fix it at your own. Involve team members. • Send it yourselves to the corresponding stakeholders. After the moment: • Trust has been violated; you must regain it. • Discuss in the team what went wrong. • Write a postmortem, be very specific. What was the problem? How did you deal with it? How are you going to prevent this?