House - Dynamic Bandwidth Throttling in a Client Server ...

Dynamic Bandwidth Throttling Bart House, Development Lead, Microsoft

The Problem Client / Server games require server Server uses high outbound bandwidth Bandwidth to update N clients at 30Hz Bandwidth = N * N * per-client-update * 30 Bandwidth for 15 clients with just 5 byte update Bandwidth = 15 * 15 * 40 * 30 = 270kbps Average home connection <300kbps We never have enough bandwidth Can not update ideally for large games

The Problem (cont.) Home machine is in a hostile environment Many devices compete for bandwidth Voice over IP can quickly saturate bandwidth High bandwidth browsing and downloading WoW patches, P2P downloads, etc… And the problem will only get worse MP3 players downloading in the background wirelessly Many Xboxes are connected wirelessly Home machine can move Team LAN parties

Questions we will answer How do we adjust bandwidth utilization to match bandwidth availability? This is the bulk of the talk As available bandwidth changes, how do we adjust game state replication? What do we do when a machine can no longer host? When matchmaking, how do we ensure someone can host the game?

Adjusting Bandwidth Adjusting is done at three different levels Connection Control Server to client connection Takes care of rapid adjustments due to client specific problems Global Control Server adjusts all client connections simultaneously Adjustments for problems across multiple clients Problems most likely due to local bottleneck History Control Server adjust overall bandwidth target as conditions change and evidence builds Provides continuity during a game Allows for growth and adjustment over course of multiple games Provides basis for estimating future performance

Connection Control Bandwidth between two sever and client Try to reach and maintain goal bandwidth Goal bandwidth is set by Global Control All traffic is UDP based Reliable messaging built on top Not part of talk today Bandwidth is always used Packets are handed up to game to fill Unused packet space is padded Ensures bandwidth is available when needed

Congestion Control When congestion is detected Reduce current bandwidth When congestion has cleared, increase current bandwidth over time until goal is reached Maintain bandwidth at goal Two signals for congestion control Increase in measured RTT Packet loss due to timeout or subsequent acknowledgements

Congestion Control States Maintain State Remain in this state if goal is reached Transition to Growth if able to maintain current bandwidth for some period of time Recovery State Entered when congestion is detected Transition to Maintain when able to achieve current bandwidth and when RTT has stabilized

Congestion Control States Growth State Growth rate established when target was set Allows for rapidly growing connections when appropriate Growth is stopped when congestion is detected Recovery state is entered Growth is stopped if measured throughput fails to come close to current bandwidth Maintain state is entered Growth occurs in steps Next step taken when measured throughput comes close enough to current bandwidth RTT threshold used for congestion warning signal Established when state is entered Adjusted as packet sizes increase

RTT Primary congestion control signal To calculate RTT Timestamp in every packet Allowed us to see changes in RTT more quickly Used low pass filter to calculate smooth RTT Baseline RTT established in Maintain state Significant deviation from baseline used a signal of congestion

Packet-loss Two types of packet-loss Loss due to subsequent acknowledgement If we get multiple subsequent acknowledgements we take this as an indication of a packet loss Loss due to timeout Packet failed to be acknowledged after some period of time Timeout calculated from filtered RTT Only causes a congestion control signal if multiple events encountered over an interval of time Spurious packet loss does not trigger control signal

Global Control The server keeps a current goal bandwidth which is divided among all client connections Attempt to reach goal by growing bandwidth backing off when bandwidth is exceeded Ability to look at behavior across multiple connections allows detection of bad connections

Global Control States Recovery State Entered when bandwidth is exceeded Slow Adjustment is entered when fully recovered Rapid Growth State Quick adjustments are made until bandwidth is exceeded or goal is encountered Slow Adjustment State Small incremental growth until bandwidth is exceeded or goal is reached Goal Reached State Goal bandwidth is maintained until bandwidth is exceeded

Global Recovery State Continue to reduce current global bandwidth as long as bandwidth is being exceeded Reduction occurs at regular interval Percentage reduction applied until some minimum Current bandwidth is reduced immediately when state is entered Amount of reduction is less when entered from rapid growth

Global Slow/Rapid Growth Grow bandwidth in steps Each step is a small/large percentage of current global bandwidth Next step taken when global bandwidth is measured and maintained over a period of time Growth continues until bandwidth is exceeded or goal is reached

Detecting Bandwidth Overuse If over 50% of the connections are in there recovery control state we assume that the bandwidth is exceeded Single bad connection can affect global control Will not cause bandwidth exceeded signal But can prohibit growth of bandwidth due to failure to deliver its share of throughput

Dividing Bandwidth Even bandwidth distribution among client connections Extra bandwidth given based on need Some clients act as voice repeaters and thus require additional bandwidth Bandwidth might be limited for bad connections

Bad Connections Any connection that accumulates congestion signals over time is eventually marked as bad We keep a counter that accumulates for every period that experienced a congestion period and decrements for every period that did not Bandwidth to that connection is limited It’s throughput is no longer taken into consideration when determining whether goal bandwidth is met All traffic sent is added to total throughput since we can’t rely on acknowledgements from connection

History Control Global bandwidth is adjusted over time Global goal bandwidth Adjusted up linearly If the host is able to consistently maintain it Adjusted down exponentially If the host fails to maintain it Bandwidth History Period Global goal held constant Periods used as quantum of measurement

Bandwidth History Periods Starts when a global goal bandwidth is set Period ends when: Goal changes and the period is canceled For instance due to a client leaving Goal reached and held for a period of time No global control recoveries occurred Period is considered successful Successful result recorded with goal bandwidth Some global control recoveries occurred Period is neither successful or failed Nothing is recorded Failure occurs Goal not reached given sufficient time Multiple global control recoveries occurred

Bandwidth History Successful/failed periods are recorded Recorded in circular buffer Large enough to hold many hours of play Stored using per-user live storage History is tied both to box and the player From the history, we can calculate: Reliability percentage for some bandwidth Failure percentage given some bandwidth

Reliability Percentage Give some bandwidth X, how reliable has this machine been at delivering that bandwidth Success(X) / (Success(X) + Failure(X))*100 Success(X) is the number of successful periods at or above bandwidth X Where Failure(X) is the number of failed periods at or below bandwidth X

Failure Percentage Given some bandwidth X, how often have we failed to deliver that bandwidth Failure(X) / (TotalSuccess + Failure(X))*100 Where TotalSuccess is the total number of successful periods regardless of bandwidth

Reliable Bandwidth Greatest bandwidth X such that ReliabilityPercentage(X) >= 95% Stated another way What bandwidth should we pick to ensure that we will only get a 1 in 20 chance of having a failure to maintain that bandwidth We want to ensure overall consistency in game play

Use of Reliable Bandwidth Almost always used as global goal Two exceptions When no history is present Conservative estimate is used Enough to be considered when picking host But not too much Below average home connection speed Reduce chance of poor game play at game start

Trying Higher Bandwidth Only consider one step higher A bandwidth that is one step higher then our reliable bandwidth Will use this higher bandwidth if: FailurePercentage(X) <= 5% Success or failure will be recorded Logic runs again

Trying Higher Bandwidths Consider this case Recent failure at 320kbps (stepped up) Connection speed at 300kbps ReliableBandwidth at 300kbps FailurePercentage(320) > 5% What happens 300kbps will be used repeatedly Each success slowly reduces FailurePercentage(320) Eventuall, FailurePercentage(320) <= 5% 320kbps will be tried Failure will be record This will repeat trying 320 once in every 2 hours

Bandwidth Control In Action Lets consider some typical scenarios No history but good connection No history but bad bandwidth Good history but temporary problem

No History Good Connection A client with no history will default to assuming a reasonable reliable bandwidth Assumption is enough to host a moderate size game If host has good pings to other clients and open NAT, it is likely they will be picked to serve Global Goal Bandwidth will use default Default is conservative Below average home connection Goal will be achievable > 50% of the time Bandwidth period will be recorded as a success Next bandwidth step will immediately be tried FailurePercentage(x) is always 0% until a failure

No History Good Connection Cont. Higher bandwidth will likely succeed Another success recorded Higher and higher bandwidths will be tried Eventually bandwidth will approach actual Latency spike on all connections will occur RTT congestion signal will trigger Connection control recovery state entered across all connections

No History Good Connection Cont. Majority congestion problems Global recovery state will be entered After recovery Slow growth will then be follwed by another recovery Multiple recoveries will cause period failure Reported failure will stop increases Another step up will not occur for a while Bandwidth usage stabilizes just below actual bandwidth

No History Bad Bandwidth Initial guess is too high Congestion will be seen immediately First period will be reported as a failure Step down bandwidth is tried next If this fails, step downs are increased exponentially Eventually bandwidth will be set below actual bandwidth Bandwidth will stabilize here

Good History Hiccups Occur Bandwidth is stable below actual bandwidth Something happens Available bandwidth is lowered If it occurs very briefly Connections will briefly experience congestion Bandwidth across all connections will be dropped Single global recovery will occur Period will not fail If it occurs over a sustained period of time Failure will be recorded Reduced bandwidth used next Reductions will continue RelaibleBandwidth is significantly reduced This is what we want

Picking Best Host Reliable bandwidth is primary factor Reflects ability of that box under that players control to reliable deliver (at least 95% of the time) bandwidth Basing the estimate on a long history captures the ability of the player and box to survive in the hostile home environment

Picking Best Host (cont.) We build a pool of hosts from those that have a reliable bandwidth large enough for current game size From this host pool we use other criteria as tie breakers Ping times to other players Nat type (open preferred over moderate) Percentage of games left gracefully Graceful exits are those that the game has a chance to remove the player from the game before the box is turned off or the network cable is unplugged

Game State Replication Game state Player positions Weapon damage Object positions in world Weapon in hand Player health Object damage As bandwidth between the server and a client changes, the game must react appropriately to the changing conditions

Priority Based Replication Updates are assigned a priority based on: Importance of update Player position more important then player health Time of last update The longer since last update, the higher the priority Client importance Can the players on that client see/hear what is being updated

Priority Based Replication Updates are then sent based on priority Time intensive task to get priority scheme right Must play, take traces and decide whether the right decisions are being made by priority system when under load This hand tuning is critical to get the best polish for the game

Host Migration Host migration is moving the hosting responsibilities from one box to another box Shadowrun only supports host initiated migration Eliminates the potential for exploitation Halo2’s host election process had unintended consequences Encouraged griefing

Host Migration Host is migrated when: Host chooses to leave Game will end, hosting is migrated and host is then removed from game Current host is no longer best Host changed between games Consistent gameplay for duration of game Between rounds would perhaps have been better Game is prematurely stopped Host bandwidth no longer supports game size

Matchmaking Good hosts Good bandwidth Open Nat High Hosting Reliability Good hosts will favor games without Will try to join a game that needs good host before trying to join one that does not But will only do so if game is a good match

Putting It All Together Home is a hard place to serve from History attempts to identify players who can manage it well Design focuses on consistency Global bandwidth increases are made over multiple rounds of play Good hosts are rewarded by having host advantage and thus encouraging players to be good hosts

Putting It All Together (cont.) Quick to respond to changing conditions Low level point-to-point control ensures continued connectivity This response happens very quickly If problems persist, global bandwidth control will kick in and reduce overall targets within minutes Host will eventually be replaced Bad host will have low reliable bandwidth

Putting It All Together During Beta bandwidth histories accurately reflected player connection abilities System repeatedly found the same good hosts as system histories were reset Bad hosts did cause bad rounds of play but where quickly eliminated from pool of hosts in future games

Wish List Ability for game to know when box has moved Create a signature that can be stored with the history that represents the network location For instance using the MAC of the local gateway along with perhaps routing information to known service Ability to manage QoS in the home Demands for bandwidth in the home are only going to get worse Efforts need to be made to help manage bandwidth across devices in the home

Wish List (cont.) Consistent Bandwidth Control and Prediction across titles and platforms Game developers should be relieved of this job Common problem for many games History is applicable across titles and platforms Power Off UDP Packet Delivery Add ability for hardware to send out notification of powering down before actually powering down This will allow others who are connected to game to be notified of removal of box and thus can handle it gracefully

House - Dynamic Bandwidth Throttling in a Client Server ...

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to House - Dynamic Bandwidth Throttling in a Client Server ... (20)

More from webhostingguy (20)

House - Dynamic Bandwidth Throttling in a Client Server ...

Editor's Notes