Inside Nerdio auto-scale: Deep dive!
Auto-scale sounds simple, spin up a few machines when usage increases and shut them down when it drops. But in reality, Nerdio Manager delivers something far more intelligent: a layered, tightly orchestrated, policy-driven auto-scale engine that operates like a well-tuned hybrid between a scheduler, AI optimizer, and an Azure-native orchestration layer. In this post, we'll dissect Nerdio auto-scale (the best I can).
A native Azure brain
Nerdio Manager runs entirely as a native Azure web application. Its auto-scale engine leverages Azure App Services, a dedicated Azure SQL Database for persistent configuration and history, and optionally a Log Analytics workspace (plus some more). These are all PaaS services btw. Auto-scale decisions trigger real-time ARM API calls to create, start, stop, or delete virtual machines (amongst other actions). Disk tier adjustments, host provisioning, and tagging are part of that too.
The intelligence is in the orchestration. The platform anticipates demand using schedules and historical data, reacts to utilization spikes, and heals broken session hosts — all governed by configurable policies per host pool, including separate/different schedules.
Granular configuration, centralized control
Administrators configure auto-scale behavior per host pool via the Nerdio Manager interface under Host Pools > Auto-scale > Configure, or via Settings > Profiles Management > Auto-scale. Each host pool has its own profile. The following settings (related to todays topic) are stored in the SQL configuration tables:
Capacity definitions (base, minimum, burst)
Scale-out triggers and thresholds
One or more Drain-Mode windows
Scale-in conditions and aggressiveness level
Schedule definition (workdays, start/end hours)
Disk tier configuration for cost optimization
Auto-heal options (restart, recreate, script)
Auto-shrink behavior for personal host pools
Each parameter corresponds to values in the Nerdio database schema, typically under dbo.HostPool, dbo.AutoScaleConfiguration, and dbo.ScheduleProfile.
These values are:
Read during every evaluation cycle
Editable from the Nerdio UI or via Nerdio REST API (/v1/hostpools/{id}/autoscale/configure)
Persisted immediately upon save
When you click “Save” in the auto-scale configuration page, the following happens:
JSON-based payload is submitted to Nerdio’s App Service
It is validated and written to Azure SQL via stored procedures
All future evaluation jobs will reference these settings
You can change, where it lives, and how it works
To truly master Auto-Scale, you must understand not just what you can change, but also where each value is stored, how it's used, and how it affects runtime behavior. Here's a breakdown:
All of these are dynamically read during evaluation cycles, either via direct SQL queries or cached values pulled at startup. Audit logs (AutoScaleAudit) reflect which configuration values triggered which actions.
Inside the evaluation cycle
Every ~5 minutes (controlled by an internal app (service) setting, not exposed via the UI or API), Nerdio Manager for Enterprise (NME) executes a background evaluation cycle that analyzes up to 10 dynamic host pools in parallel. Each host pool undergoes a deterministic, logic-driven sequence that governs auto-scaling behavior, host availability, drain scheduling, health verification, and tagging. The evaluation process ensures optimal user experience and cost-efficiency while preventing disruptions and maintaining compliance with custom-defined automation logic.
The evaluation sequence consists of the following stages:
1. Configuration & Metrics Fetch
The process begins by retrieving the complete configuration and runtime state:
Configuration data is fetched from SQL tables, including: AutoScaleConfiguration (trigger thresholds, schedules, pool behavior) ScheduleProfile (active hours, minimum hosts, rolling drain windows) GoldenImageReference (for provisioning new VMs)
Runtime metrics are gathered from: Azure Monitor (e.g., CPU %, memory %) AVD Broker (e.g., number and type of sessions) Azure VM metadata (e.g., tags like AllowNewSession, power state, uptime)
These settings determine what actions (if any) need to occur within this evaluation cycle.
2. Auto-Heal Processing
Before considering scale actions, Nerdio verifies the health of all VMs in the host pool:
Health is assessed via: AVD agent heartbeat Provisioning success state internal retry counters stored in SQL
Auto-heal behavior is governed by app settings such as: AutoScale:RestartAttempts AutoScale:AutoHealEnabled
If a VM is unhealthy, Nerdio may attempt to restart it or take corrective action depending on how many restart attempts have been logged.
📌 Auto-heal takes place before any scheduling or trigger evaluation and ensures that only healthy VMs are counted toward active capacity. This prevents unhealthy or unresponsive VMs from skewing load calculations.
3. Schedule & Rolling Drain Window Validation (see further down as well)
Next, Nerdio evaluates whether actions are allowed at the current time, based on UTC (default):
The system reads the ScheduleProfile (matched against the current UTC time block; no fallback or alternative schedule logic) for the pool and matches the current time against defined scale windows and drain windows: Each window includes: name, start time, and one or more of the following: Minimum active hosts, Pre-stage count, Rolling drain percentage, Agressiveness, etc.
Only active windows for the current UTC time are considered; there is no fallback logic to other days or times.
If a Rolling Drain Mode window is active, Nerdio calculates the allowed number of hosts that may be in drain mode (i.e., AllowNewSession=False) as a percentage of the total pool.
📌 Rolling drain logic is initiated here, but applied during the scale-in phase (step 5).
4. Scale-Out Trigger Evaluation
If additional hosts may be needed, Nerdio checks for scale-out conditions:
All scale-out triggers are retrieved from the configuration: CPU %, RAM %, average active sessions per host, max sessions, user latency, and more
The logic follows OR-based evaluation: any one trigger exceeding its threshold is sufficient to activate scale-out
If triggered: Nerdio attempts to start a VM from the pre-provisioned stopped pool If no stopped hosts are available, new VMs are created using the golden image stored in GoldenImageReference Nerdio will not exceed maximum VM count configured for the pool Minimum active host count from the current schedule is always enforced, even if no triggers are met
📌 The combination of metrics and thresholds ensures responsive scale-up while protecting against overprovisioning.
5. Scale-In Evaluation & Rolling Drain Enforcement
If usage drops, Nerdio checks whether it can safely scale in host capacity:
Multiple factors are evaluated, including: Session count (active + disconnected) Idle time (last session ended) Whether AllowNewSession=False is already set The aggressiveness setting (Low, Medium, High), which affects how quickly hosts are selected for scale-in
If scale-in is allowed: Nerdio selects the best candidate host(s) using a ranking logic: Lowest number of active sessions Hosts already in drain mode Lowest total sessions It then sets AllowNewSession=False, activating Rolling Drain Mode
Rolling Drain Window enforcement happens here: The number of VMs being drained is compared to the percentage target from the current schedule Drain progression is staggered, never affecting more hosts than allowed
⚠️ Drain mode does not forcibly log off users. It simply prevents new sessions and waits until the host is empty (or timeout hits).
Once a drained host reaches 0 sessions (or meets scale-in policy like minimum active hosts), it is: Shut down (if re-used later) or deleted (if disposable capacity)
6. Auto-Shrink Processing (Personal Desktops Only)
For personal desktop pools (not multi-session), Nerdio evaluates long-term inactivity:
If a VM has not been used for N days (e.g., 30), it is flagged with PendingDeletion=Yes
A background task is queued to delete the VM and clean up resources
📌 This helps reclaim unused personal desktop capacity while minimizing manual admin intervention.
7. Logging & Tagging
Every scaling decision and action is logged and traceable across multiple systems:
AutoScaleHistory — Overview of what happened, per pool, per cycle
AutoScaleAudit — Full technical breakdown of why a decision was made (trigger values, threshold match, etc.)
Azure Resource Tags — Applied to the VM to record: Last scaled in/out time Drain mode status Health status (if applicable) Custom logic flags (RestartedByAutoHeal, PendingDeletion, etc.)
These logs provide full auditability for debugging, support, and compliance.
Rolling Drain Mode
Rolling Drain Mode in Nerdio Auto-Scale gradually removes hosts from service without disrupting active users. Nerdio sets on selected hosts based on a defined percentage in the rolling drain schedule. These hosts remain online until all active sessions end. Drain candidates are ranked (lowest session count first), and multiple hosts can be placed in drain mode concurrently, up to the defined percentage. Once empty, each host is shut down or deleted based on scale-in policy.
Host Selection Logic:
When draining hosts, Nerdio uses this selection order:
Hosts with the lowest number of active sessions
Hosts already in drain mode (carry‑over preference)
Hosts with the lowest total sessions (active + disconnected)
Behavior & Sequence
Across roughly 5‑minute evaluation cycles (part of the above cycle):
Nerdio evaluates the current count of drain‑mode hosts against the configured percentage.
If more hosts need draining, it picks the next eligible host(s) as per the logic above.
It sets AllowNewSession=False on those hosts (Azure AVD drain mode).
Hosts remain available until all active sessions disconnect or a configured minimum number of active hosts is reached.
Only when a drained host has zero sessions will it be considered for scale‑in (shutdown or deletion).
The process repeats until the targeted drain percentage is reached
Configuration Options
Windows (drain windows): Each specifies a name, start time
Load Balancing mode: Choose Depth‑First or Breadth‑First (available only in the Premium edition)
Administrators do not force logoffs; all sessions active on drained hosts remain until users sign out.
Agressiveness: Controls how assertively hosts are selected for drain (Low, Medium, High).
Where data lives
Azure SQL holds: Host pool configuration (HostPool, AutoScaleConfiguration, ScheduleProfile) Golden image references (GoldenImageReference) Auto-scale history (AutoScaleHistory) Audit logs (AutoScaleAudit)
Log Analytics (if configured): Login failures Heartbeat failures CPU/RAM metrics
VM Tags (read and written at runtime): LastProcessedByAutoScale DrainMode=Enabled PendingDeletion=Yes
Visualizing behavior with auto-scale Insights
Auto-Scale Insights (Premium) visualizes host pool capacity and usage trends in 30-minute blocks based on the pool's local time zone. For each interval, it shows:
Minimum and maximum available sessions per host (solid vs. drain-mode sessions)
Actual session usage over time
Over-provisioning (more capacity than needed)
Under-provisioning (insufficient capacity for demand)
Number of excess users (unable to connect due to no available sessions)
Admins define a session density value (e.g., 12 users per host) to help correlate performance triggers (CPU/RAM) with actual session load.
In Premium environments with Azure OpenAI deployed in the customer's own tenant, Nerdio provides AI-powered optimization recommendations after analyzing 7 days of data. These may include:
Adjusting minimum and maximum host pool sizes
Tuning session thresholds and pre-stage targets
Refining working hours or drain window settings
All AI processing runs entirely within the customer’s Azure environment — no data ever leaves the tenant.
Orchestrating via API
Auto-scale doesn’t just schedule. It acts. Each VM action (start, stop, delete, recreate) is executed through Azure ARM API:
Start: POST /SessionHost/Start
Stop: POST /SessionHost/Stop
Delete: DELETE /SessionHost
Recreate: POST /SessionHost/Recreate
Tag updates are pushed using PATCH, including:
LastProcessedByAutoScale
DrainMode=Enabled
PendingDeletion=Yes
Provisioning always respects the image template, region, naming convention, and disk policy configured in the host pool. Disk tier is automatically switched (stopped/running) on shutdown and start. This configuration is stored in the GoldenImageReference and VmDeploymentProfile tables in SQL.
Defaults that keep you safe
Evaluation interval: 5 minutes
Max host pools evaluated concurrently: 10
Minimum data for Insights: 2 full days
Auto-shrink delay: 24 hours
Default scale-in mode: Medium (unless overridden in host pool configuration)
Retries, cleanup attempts, and restart intervals are controlled via internal app settings and executed automatically in the next cycle.
Closing thoughts
This isn’t a scaling engine that simply reacts. It evaluates, logs, adjusts, heals, and advises — all while staying in lockstep with your defined policy and operational needs. Nerdio’s Auto-Scale is arguably the most advanced auto-scale solution for Azure Virtual Desktop today. Whether you need to reduce cost, maintain performance, or enforce compliance — it’s all here, automated, explainable, and in your control.
In a next article we will dive deeper into our Recommendation Engine, which is also closely related to auto-scale and its data.
Architect, Design, Deploy Docker, Kubernetes, ArgoCD, AKS, EKS, Azure DevOps, DevSecOps, k8s Networks, Security, Storage, WebApps/API Gateways, SQL/Data Pipelines, ML/AI Pipelines using Azure Data Bricks & Data Factory.
2wLove this, Bas