SlideShare a Scribd company logo
 
Dynamic Bandwidth Throttling Bart House, Development Lead, Microsoft
The Problem Client / Server games require server Server uses high outbound bandwidth Bandwidth to update N clients at 30Hz Bandwidth = N * N * per-client-update * 30 Bandwidth for 15 clients with just 5 byte update Bandwidth = 15 * 15 * 40 * 30 = 270kbps Average home connection <300kbps We never have enough bandwidth Can not update ideally for large games
The Problem (cont.) Home machine is in a hostile environment Many devices compete for bandwidth Voice over IP can quickly saturate bandwidth High bandwidth browsing and downloading WoW patches, P2P downloads, etc… And the problem will only get worse MP3 players downloading in the background wirelessly Many Xboxes are connected wirelessly Home machine can move Team LAN parties
Questions we will answer How do we adjust bandwidth utilization to match bandwidth availability? This is the bulk of the talk As available bandwidth changes, how do we adjust game state replication? What do we do when a machine can no longer host? When matchmaking, how do we ensure someone can host the game?
Adjusting Bandwidth Adjusting is done at three different levels Connection Control Server to client connection Takes care of rapid adjustments due to client specific problems Global Control Server adjusts all client connections simultaneously Adjustments for problems across multiple clients Problems most likely due to local bottleneck History Control Server adjust overall bandwidth target as conditions change and evidence builds Provides continuity during a game Allows for growth and adjustment over course of multiple games Provides basis for estimating future performance
Connection Control Bandwidth between two sever and client Try to reach and maintain goal bandwidth Goal bandwidth is set by Global Control All traffic is UDP based Reliable messaging built on top Not part of talk today Bandwidth is always used Packets are handed up to game to fill Unused packet space is padded Ensures bandwidth is available when needed
Congestion Control When congestion is detected Reduce current bandwidth When congestion has cleared, increase current bandwidth over time until goal is reached Maintain  bandwidth at goal Two signals for congestion control Increase in measured RTT Packet loss due to timeout or subsequent acknowledgements
Congestion Control States Maintain State Remain in this state if goal is reached Transition to Growth if able to maintain current bandwidth for some period of time Recovery State Entered when congestion is detected Transition to Maintain when able to achieve current bandwidth and when RTT has stabilized
Congestion Control States Growth State Growth rate established when target was set Allows for rapidly growing connections when appropriate Growth is stopped when congestion is detected Recovery state is entered Growth is stopped if measured throughput fails to come close to current bandwidth Maintain state is entered Growth occurs in steps Next step taken when measured throughput comes close enough to current bandwidth RTT threshold used for congestion warning signal Established when state is entered Adjusted as packet sizes increase
RTT Primary congestion control signal To calculate RTT Timestamp in every packet Allowed us to see changes in RTT more quickly Used low pass filter to calculate smooth RTT Baseline RTT established in Maintain state Significant deviation from baseline used a signal of congestion
Packet-loss Two types of packet-loss Loss due to subsequent acknowledgement If we get multiple subsequent acknowledgements we take this as an indication of a packet loss Loss due to timeout Packet failed to be acknowledged after some period of time Timeout calculated from filtered RTT Only causes a congestion control signal if multiple events encountered over an interval of time Spurious packet loss does not trigger control signal
Global Control The server keeps a current goal bandwidth which is divided among all client connections Attempt to reach goal by growing bandwidth backing off when bandwidth is exceeded Ability to look at behavior across multiple connections allows detection of bad connections
Global Control States Recovery State Entered when bandwidth is exceeded Slow Adjustment is entered when fully recovered Rapid Growth State Quick adjustments are made until bandwidth is exceeded or goal is encountered Slow Adjustment State Small incremental growth until bandwidth is exceeded or goal is reached Goal Reached State Goal bandwidth is maintained until bandwidth is exceeded
Global Recovery State Continue to reduce current global bandwidth as long as bandwidth is being exceeded Reduction occurs at regular interval Percentage reduction applied until some minimum Current bandwidth is reduced immediately when state is entered Amount of reduction is less when entered from rapid growth
Global Slow/Rapid Growth Grow bandwidth in steps Each step is a small/large percentage of current global bandwidth Next step taken when global bandwidth is measured and maintained over a period of time Growth continues until bandwidth is exceeded or goal is reached
Detecting Bandwidth Overuse If over 50% of the connections are in there recovery control state we assume that the bandwidth is exceeded Single bad connection can affect global control Will not cause bandwidth exceeded signal But can prohibit growth of bandwidth due to failure to deliver its share of throughput
Dividing Bandwidth Even bandwidth distribution among client connections Extra bandwidth given based on need Some clients act as voice repeaters and thus require additional bandwidth Bandwidth might be limited for bad connections
Bad Connections Any connection that accumulates congestion signals over time is eventually marked as bad We keep a counter that accumulates for every period that experienced a congestion period and decrements for every period that did not Bandwidth to that connection is limited It’s throughput is no longer taken into consideration when determining whether goal bandwidth is met All traffic sent is added to total throughput since we can’t rely on acknowledgements from connection
History Control Global bandwidth is adjusted over time Global goal bandwidth Adjusted up linearly If the host is able to consistently maintain it Adjusted down exponentially If the host fails to maintain it Bandwidth History Period Global goal held constant Periods used as quantum of measurement
Bandwidth History Periods Starts when a global goal bandwidth is set Period ends when: Goal changes and the period is canceled For instance due to a client leaving Goal reached and held for a period of time No global control recoveries occurred Period is considered successful Successful result recorded with goal bandwidth Some global control recoveries occurred  Period is neither successful or failed Nothing is recorded Failure occurs Goal not reached given sufficient time Multiple global control recoveries occurred
Bandwidth History Successful/failed periods are recorded Recorded in circular buffer Large enough to hold many hours of play Stored using per-user live storage History is tied both to box and the player From the history, we can calculate: Reliability percentage for some bandwidth Failure percentage given some bandwidth
Reliability Percentage Give some bandwidth X, how reliable has this machine been at delivering that bandwidth Success(X) / (Success(X) + Failure(X))*100 Success(X) is the number of successful periods at or above bandwidth X Where Failure(X) is the number of failed periods at or below bandwidth X
Failure Percentage Given some bandwidth X, how often have we failed to deliver that bandwidth Failure(X) / (TotalSuccess + Failure(X))*100 Where TotalSuccess is the total number of successful periods regardless of bandwidth
Reliable Bandwidth Greatest bandwidth X such that ReliabilityPercentage(X) >= 95% Stated another way What bandwidth should we pick to ensure that we will only get a 1 in 20 chance of having a failure to maintain that bandwidth We want to ensure overall consistency in game play
Use of Reliable Bandwidth Almost always used as global goal Two exceptions When no history is present Conservative estimate is used Enough to be considered when picking host But not too much Below average home connection speed Reduce chance of poor game play at game start
Trying Higher Bandwidth Only consider one step higher A bandwidth that is one step higher then our reliable bandwidth Will use this higher bandwidth if: FailurePercentage(X) <= 5% Success or failure will be recorded Logic runs again
Trying Higher Bandwidths Consider this case Recent failure at 320kbps (stepped up) Connection speed at 300kbps ReliableBandwidth at 300kbps FailurePercentage(320) > 5% What happens 300kbps will be used repeatedly Each success slowly reduces FailurePercentage(320) Eventuall, FailurePercentage(320) <= 5% 320kbps will be tried Failure will be record This will repeat trying 320 once in every 2 hours
Bandwidth Control In Action Lets consider some typical scenarios No history but good connection No history but bad bandwidth Good history but temporary problem
No History Good Connection A client with no history will default to assuming a reasonable reliable bandwidth Assumption is enough to host a moderate size game If host has good pings to other clients and open NAT, it is likely they will be picked to serve Global Goal Bandwidth will use default Default is conservative Below average home connection Goal will be achievable > 50% of the time Bandwidth period will be recorded as a success Next bandwidth step will immediately be tried FailurePercentage(x) is always 0% until a failure
No History Good Connection Cont. Higher bandwidth will likely succeed Another success recorded Higher and higher bandwidths will be tried Eventually bandwidth will approach actual Latency spike on all connections will occur RTT congestion signal will trigger Connection control recovery state entered across all connections
No History Good Connection Cont. Majority congestion problems Global recovery state will be entered After recovery Slow growth will then be follwed by another recovery Multiple recoveries will cause period failure Reported failure will stop increases Another step up will not occur for a while Bandwidth usage stabilizes just below actual bandwidth
No History Bad Bandwidth Initial guess is too high Congestion will be seen immediately First period will be reported as a failure Step down bandwidth is tried next If this fails, step downs are increased exponentially Eventually bandwidth will be set below actual bandwidth Bandwidth will stabilize here
Good History Hiccups Occur Bandwidth is stable below actual bandwidth Something happens Available bandwidth is lowered If it occurs very briefly Connections will briefly experience congestion Bandwidth across all connections will be dropped Single global recovery will occur Period will not fail If it occurs over a sustained period of time Failure will be recorded Reduced bandwidth used next Reductions will continue RelaibleBandwidth is significantly reduced This is what we want
Picking Best Host Reliable bandwidth is primary factor Reflects ability of that box under that players control to reliable deliver (at least 95% of the time) bandwidth Basing the estimate on a long history captures the ability of the player and box to survive in the hostile home environment
Picking Best Host (cont.) We build a pool of hosts from those that have a reliable bandwidth large enough for current game size From this host pool we use other criteria as tie breakers Ping times to other players Nat type (open preferred over moderate) Percentage of games left gracefully Graceful exits are those that the game has a chance to remove the player from the game before the box is turned off or the network cable is unplugged
Game State Replication Game state Player positions Weapon damage Object positions in world Weapon in hand Player health Object damage As bandwidth between the server and a client changes, the game must react appropriately to the changing conditions
Priority Based Replication Updates are assigned a priority based on: Importance of update Player position more important then player health Time of last update The longer since last update, the higher the priority Client importance Can the players on that client see/hear what is being updated
Priority Based Replication Updates are then sent based on priority Time intensive task to get priority scheme right Must play, take traces and decide whether the right decisions are being made by priority system when under load This hand tuning is critical to get the best polish for the game
Host Migration Host migration is moving the hosting responsibilities from one box to another box Shadowrun only supports host initiated migration Eliminates the potential for exploitation Halo2’s host election process had unintended consequences Encouraged griefing
Host Migration Host is migrated when: Host chooses to leave Game will end, hosting is migrated and host is then removed from game Current host is no longer best Host changed between games Consistent gameplay for duration of game Between rounds would perhaps have been better Game is prematurely stopped Host bandwidth no longer supports game size
Matchmaking Good hosts Good bandwidth Open Nat High Hosting Reliability Good hosts will favor games without Will try to join a game that needs good host before trying to join one that does not But will only do so if game is a good match
Putting It All Together Home is a hard place to serve from History attempts to identify players who can manage it well Design focuses on consistency Global bandwidth increases are made over multiple rounds of play Good hosts are rewarded by having host advantage and thus encouraging players to be good hosts
Putting It All Together (cont.) Quick to respond to changing conditions Low level point-to-point control ensures continued connectivity This response happens very quickly If problems persist, global bandwidth control will kick in and reduce overall targets within minutes Host will eventually be replaced Bad host will have low reliable bandwidth
Putting It All Together During Beta bandwidth histories accurately reflected player connection abilities System repeatedly found the same good hosts as system histories were reset Bad hosts did cause bad rounds of play but where quickly eliminated from pool of hosts in future games
Wish List Ability for game to know when box has moved Create a signature that can be stored with the history that represents the network location For instance using the MAC of the local gateway along with perhaps routing information to known service Ability to manage QoS in the home Demands for bandwidth in the home are only going to get worse Efforts need to be made to help manage bandwidth across devices in the home
Wish List (cont.) Consistent Bandwidth Control and Prediction across titles and platforms Game developers should be relieved of this job Common problem for many games History is applicable across titles and platforms Power Off UDP Packet Delivery Add ability for hardware to send out notification of powering down before actually powering down This will allow others who are connected to game to be notified of removal of box and thus can handle it gracefully

More Related Content

PPTX
Working of TCP
PPT
Tcp Reliability Flow Control
PDF
Network performance overview
PPTX
TCP/IP
PDF
RIPE 76: TCP and BBR
PPTX
Tcp(no ip) review part1
PPTX
Tcp(no ip) review part2
PDF
RIPE 76: Measuring ATR
Working of TCP
Tcp Reliability Flow Control
Network performance overview
TCP/IP
RIPE 76: TCP and BBR
Tcp(no ip) review part1
Tcp(no ip) review part2
RIPE 76: Measuring ATR

What's hot (20)

PDF
Flow control 11
PPTX
BADCamp 2017 - Anatomy of DDoS
PPT
Tcp Immediate Data Transfer
PPTX
DrupalCon Vienna 2017 - Anatomy of DDoS
PPTX
Tieu luan qo s
PDF
DNS-OARC 34: Measuring DNS Flag Day 2020
PPTX
Congestion control
PPTX
Operation of Ping - (Computer Networking)
PDF
Comparative Analysis of Different TCP Variants in Mobile Ad-Hoc Network
PPTX
WTFast vs VPN
PPTX
Congestion Control
PPTX
Connection Establishment & Flow and Congestion Control
PPT
Tcp Congestion Avoidance
PDF
NZNOG 2020: Buffers, Buffer Bloat and BBR
PDF
Optimal connection
PPTX
PPTX
Application Layer Throughput Control For Video Streaming over HTTP2
PPT
Congetion Control.pptx
PPTX
NTP Server - How it works?
Flow control 11
BADCamp 2017 - Anatomy of DDoS
Tcp Immediate Data Transfer
DrupalCon Vienna 2017 - Anatomy of DDoS
Tieu luan qo s
DNS-OARC 34: Measuring DNS Flag Day 2020
Congestion control
Operation of Ping - (Computer Networking)
Comparative Analysis of Different TCP Variants in Mobile Ad-Hoc Network
WTFast vs VPN
Congestion Control
Connection Establishment & Flow and Congestion Control
Tcp Congestion Avoidance
NZNOG 2020: Buffers, Buffer Bloat and BBR
Optimal connection
Application Layer Throughput Control For Video Streaming over HTTP2
Congetion Control.pptx
NTP Server - How it works?
Ad

Viewers also liked (7)

PPT
Managing Clients' Mission Critical Applications
PPT
DAL17thOctFinal_007.ppt
PPT
PowerPoint presentation
PDF
Product Number: 0
PPT
Webmaster's Report - IEEE Microwave Theory and Techniques Society
PPT
Webmaster's Report - IEEE Microwave Theory and Techniques Society
PPT
slides (PPT)
Managing Clients' Mission Critical Applications
DAL17thOctFinal_007.ppt
PowerPoint presentation
Product Number: 0
Webmaster's Report - IEEE Microwave Theory and Techniques Society
Webmaster's Report - IEEE Microwave Theory and Techniques Society
slides (PPT)
Ad

Similar to House - Dynamic Bandwidth Throttling in a Client Server ... (20)

PPT
Network Application Performance
PPT
Nss Labs Dpi Intro V3
PPTX
chpater 4 FOR Information techonogy students
PPT
PV Powerpoint
PPTX
SWIFT: Tango's Infrastructure For Real-Time Video Call Service
PPT
Computer Networking
PDF
What happens when adaptive video streaming players compete in time-varying ba...
PPTX
Impact of Satellite Networks on Transport Layer Protocols
PDF
Transaction TCP
PDF
Slow_Throughput_Best_Practice_Guides_v1.pdf
PPT
powerpoint
PPTX
WWW for Mobile Apps
PDF
How to Talk to My ISP in Technical Terms About a Slow Internet.pdf
PPTX
module-3-chapter-4-replication-san2.pptx
PPT
Providing Controlled Quality Assurance in Video Streaming ...
PDF
Server-based and Network-assisted Solutions for Adaptive Video Streaming
PPTX
Online TCP-IP Networking Assignment Help
PPT
Congestionin Data Networks
PPTX
transport layer pptxdkididkdkdkddjjdjffkfif
PPT
MM_Conferencing.ppt
Network Application Performance
Nss Labs Dpi Intro V3
chpater 4 FOR Information techonogy students
PV Powerpoint
SWIFT: Tango's Infrastructure For Real-Time Video Call Service
Computer Networking
What happens when adaptive video streaming players compete in time-varying ba...
Impact of Satellite Networks on Transport Layer Protocols
Transaction TCP
Slow_Throughput_Best_Practice_Guides_v1.pdf
powerpoint
WWW for Mobile Apps
How to Talk to My ISP in Technical Terms About a Slow Internet.pdf
module-3-chapter-4-replication-san2.pptx
Providing Controlled Quality Assurance in Video Streaming ...
Server-based and Network-assisted Solutions for Adaptive Video Streaming
Online TCP-IP Networking Assignment Help
Congestionin Data Networks
transport layer pptxdkididkdkdkddjjdjffkfif
MM_Conferencing.ppt

More from webhostingguy (20)

PPT
File Upload
PDF
Running and Developing Tests with the Apache::Test Framework
PDF
MySQL and memcached Guide
PPT
Novell® iChain® 2.3
PDF
Load-balancing web servers Load-balancing web servers
PDF
SQL Server 2008 Consolidation
PDF
What is mod_perl?
PDF
What is mod_perl?
PDF
Master Service Agreement
PPT
PPT
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PDF
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
PPT
Managing Diverse IT Infrastructure
PPT
Web design for business.ppt
PPS
IT Power Management Strategy
PPS
Excel and SQL Quick Tricks for Merchandisers
PPT
OLUG_xen.ppt
PPT
Parallels Hosting Products
PPT
Microsoft PowerPoint presentation 2.175 Mb
PDF
Reseller's Guide
File Upload
Running and Developing Tests with the Apache::Test Framework
MySQL and memcached Guide
Novell® iChain® 2.3
Load-balancing web servers Load-balancing web servers
SQL Server 2008 Consolidation
What is mod_perl?
What is mod_perl?
Master Service Agreement
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Managing Diverse IT Infrastructure
Web design for business.ppt
IT Power Management Strategy
Excel and SQL Quick Tricks for Merchandisers
OLUG_xen.ppt
Parallels Hosting Products
Microsoft PowerPoint presentation 2.175 Mb
Reseller's Guide

House - Dynamic Bandwidth Throttling in a Client Server ...

  • 1.  
  • 2. Dynamic Bandwidth Throttling Bart House, Development Lead, Microsoft
  • 3. The Problem Client / Server games require server Server uses high outbound bandwidth Bandwidth to update N clients at 30Hz Bandwidth = N * N * per-client-update * 30 Bandwidth for 15 clients with just 5 byte update Bandwidth = 15 * 15 * 40 * 30 = 270kbps Average home connection <300kbps We never have enough bandwidth Can not update ideally for large games
  • 4. The Problem (cont.) Home machine is in a hostile environment Many devices compete for bandwidth Voice over IP can quickly saturate bandwidth High bandwidth browsing and downloading WoW patches, P2P downloads, etc… And the problem will only get worse MP3 players downloading in the background wirelessly Many Xboxes are connected wirelessly Home machine can move Team LAN parties
  • 5. Questions we will answer How do we adjust bandwidth utilization to match bandwidth availability? This is the bulk of the talk As available bandwidth changes, how do we adjust game state replication? What do we do when a machine can no longer host? When matchmaking, how do we ensure someone can host the game?
  • 6. Adjusting Bandwidth Adjusting is done at three different levels Connection Control Server to client connection Takes care of rapid adjustments due to client specific problems Global Control Server adjusts all client connections simultaneously Adjustments for problems across multiple clients Problems most likely due to local bottleneck History Control Server adjust overall bandwidth target as conditions change and evidence builds Provides continuity during a game Allows for growth and adjustment over course of multiple games Provides basis for estimating future performance
  • 7. Connection Control Bandwidth between two sever and client Try to reach and maintain goal bandwidth Goal bandwidth is set by Global Control All traffic is UDP based Reliable messaging built on top Not part of talk today Bandwidth is always used Packets are handed up to game to fill Unused packet space is padded Ensures bandwidth is available when needed
  • 8. Congestion Control When congestion is detected Reduce current bandwidth When congestion has cleared, increase current bandwidth over time until goal is reached Maintain bandwidth at goal Two signals for congestion control Increase in measured RTT Packet loss due to timeout or subsequent acknowledgements
  • 9. Congestion Control States Maintain State Remain in this state if goal is reached Transition to Growth if able to maintain current bandwidth for some period of time Recovery State Entered when congestion is detected Transition to Maintain when able to achieve current bandwidth and when RTT has stabilized
  • 10. Congestion Control States Growth State Growth rate established when target was set Allows for rapidly growing connections when appropriate Growth is stopped when congestion is detected Recovery state is entered Growth is stopped if measured throughput fails to come close to current bandwidth Maintain state is entered Growth occurs in steps Next step taken when measured throughput comes close enough to current bandwidth RTT threshold used for congestion warning signal Established when state is entered Adjusted as packet sizes increase
  • 11. RTT Primary congestion control signal To calculate RTT Timestamp in every packet Allowed us to see changes in RTT more quickly Used low pass filter to calculate smooth RTT Baseline RTT established in Maintain state Significant deviation from baseline used a signal of congestion
  • 12. Packet-loss Two types of packet-loss Loss due to subsequent acknowledgement If we get multiple subsequent acknowledgements we take this as an indication of a packet loss Loss due to timeout Packet failed to be acknowledged after some period of time Timeout calculated from filtered RTT Only causes a congestion control signal if multiple events encountered over an interval of time Spurious packet loss does not trigger control signal
  • 13. Global Control The server keeps a current goal bandwidth which is divided among all client connections Attempt to reach goal by growing bandwidth backing off when bandwidth is exceeded Ability to look at behavior across multiple connections allows detection of bad connections
  • 14. Global Control States Recovery State Entered when bandwidth is exceeded Slow Adjustment is entered when fully recovered Rapid Growth State Quick adjustments are made until bandwidth is exceeded or goal is encountered Slow Adjustment State Small incremental growth until bandwidth is exceeded or goal is reached Goal Reached State Goal bandwidth is maintained until bandwidth is exceeded
  • 15. Global Recovery State Continue to reduce current global bandwidth as long as bandwidth is being exceeded Reduction occurs at regular interval Percentage reduction applied until some minimum Current bandwidth is reduced immediately when state is entered Amount of reduction is less when entered from rapid growth
  • 16. Global Slow/Rapid Growth Grow bandwidth in steps Each step is a small/large percentage of current global bandwidth Next step taken when global bandwidth is measured and maintained over a period of time Growth continues until bandwidth is exceeded or goal is reached
  • 17. Detecting Bandwidth Overuse If over 50% of the connections are in there recovery control state we assume that the bandwidth is exceeded Single bad connection can affect global control Will not cause bandwidth exceeded signal But can prohibit growth of bandwidth due to failure to deliver its share of throughput
  • 18. Dividing Bandwidth Even bandwidth distribution among client connections Extra bandwidth given based on need Some clients act as voice repeaters and thus require additional bandwidth Bandwidth might be limited for bad connections
  • 19. Bad Connections Any connection that accumulates congestion signals over time is eventually marked as bad We keep a counter that accumulates for every period that experienced a congestion period and decrements for every period that did not Bandwidth to that connection is limited It’s throughput is no longer taken into consideration when determining whether goal bandwidth is met All traffic sent is added to total throughput since we can’t rely on acknowledgements from connection
  • 20. History Control Global bandwidth is adjusted over time Global goal bandwidth Adjusted up linearly If the host is able to consistently maintain it Adjusted down exponentially If the host fails to maintain it Bandwidth History Period Global goal held constant Periods used as quantum of measurement
  • 21. Bandwidth History Periods Starts when a global goal bandwidth is set Period ends when: Goal changes and the period is canceled For instance due to a client leaving Goal reached and held for a period of time No global control recoveries occurred Period is considered successful Successful result recorded with goal bandwidth Some global control recoveries occurred Period is neither successful or failed Nothing is recorded Failure occurs Goal not reached given sufficient time Multiple global control recoveries occurred
  • 22. Bandwidth History Successful/failed periods are recorded Recorded in circular buffer Large enough to hold many hours of play Stored using per-user live storage History is tied both to box and the player From the history, we can calculate: Reliability percentage for some bandwidth Failure percentage given some bandwidth
  • 23. Reliability Percentage Give some bandwidth X, how reliable has this machine been at delivering that bandwidth Success(X) / (Success(X) + Failure(X))*100 Success(X) is the number of successful periods at or above bandwidth X Where Failure(X) is the number of failed periods at or below bandwidth X
  • 24. Failure Percentage Given some bandwidth X, how often have we failed to deliver that bandwidth Failure(X) / (TotalSuccess + Failure(X))*100 Where TotalSuccess is the total number of successful periods regardless of bandwidth
  • 25. Reliable Bandwidth Greatest bandwidth X such that ReliabilityPercentage(X) >= 95% Stated another way What bandwidth should we pick to ensure that we will only get a 1 in 20 chance of having a failure to maintain that bandwidth We want to ensure overall consistency in game play
  • 26. Use of Reliable Bandwidth Almost always used as global goal Two exceptions When no history is present Conservative estimate is used Enough to be considered when picking host But not too much Below average home connection speed Reduce chance of poor game play at game start
  • 27. Trying Higher Bandwidth Only consider one step higher A bandwidth that is one step higher then our reliable bandwidth Will use this higher bandwidth if: FailurePercentage(X) <= 5% Success or failure will be recorded Logic runs again
  • 28. Trying Higher Bandwidths Consider this case Recent failure at 320kbps (stepped up) Connection speed at 300kbps ReliableBandwidth at 300kbps FailurePercentage(320) > 5% What happens 300kbps will be used repeatedly Each success slowly reduces FailurePercentage(320) Eventuall, FailurePercentage(320) <= 5% 320kbps will be tried Failure will be record This will repeat trying 320 once in every 2 hours
  • 29. Bandwidth Control In Action Lets consider some typical scenarios No history but good connection No history but bad bandwidth Good history but temporary problem
  • 30. No History Good Connection A client with no history will default to assuming a reasonable reliable bandwidth Assumption is enough to host a moderate size game If host has good pings to other clients and open NAT, it is likely they will be picked to serve Global Goal Bandwidth will use default Default is conservative Below average home connection Goal will be achievable > 50% of the time Bandwidth period will be recorded as a success Next bandwidth step will immediately be tried FailurePercentage(x) is always 0% until a failure
  • 31. No History Good Connection Cont. Higher bandwidth will likely succeed Another success recorded Higher and higher bandwidths will be tried Eventually bandwidth will approach actual Latency spike on all connections will occur RTT congestion signal will trigger Connection control recovery state entered across all connections
  • 32. No History Good Connection Cont. Majority congestion problems Global recovery state will be entered After recovery Slow growth will then be follwed by another recovery Multiple recoveries will cause period failure Reported failure will stop increases Another step up will not occur for a while Bandwidth usage stabilizes just below actual bandwidth
  • 33. No History Bad Bandwidth Initial guess is too high Congestion will be seen immediately First period will be reported as a failure Step down bandwidth is tried next If this fails, step downs are increased exponentially Eventually bandwidth will be set below actual bandwidth Bandwidth will stabilize here
  • 34. Good History Hiccups Occur Bandwidth is stable below actual bandwidth Something happens Available bandwidth is lowered If it occurs very briefly Connections will briefly experience congestion Bandwidth across all connections will be dropped Single global recovery will occur Period will not fail If it occurs over a sustained period of time Failure will be recorded Reduced bandwidth used next Reductions will continue RelaibleBandwidth is significantly reduced This is what we want
  • 35. Picking Best Host Reliable bandwidth is primary factor Reflects ability of that box under that players control to reliable deliver (at least 95% of the time) bandwidth Basing the estimate on a long history captures the ability of the player and box to survive in the hostile home environment
  • 36. Picking Best Host (cont.) We build a pool of hosts from those that have a reliable bandwidth large enough for current game size From this host pool we use other criteria as tie breakers Ping times to other players Nat type (open preferred over moderate) Percentage of games left gracefully Graceful exits are those that the game has a chance to remove the player from the game before the box is turned off or the network cable is unplugged
  • 37. Game State Replication Game state Player positions Weapon damage Object positions in world Weapon in hand Player health Object damage As bandwidth between the server and a client changes, the game must react appropriately to the changing conditions
  • 38. Priority Based Replication Updates are assigned a priority based on: Importance of update Player position more important then player health Time of last update The longer since last update, the higher the priority Client importance Can the players on that client see/hear what is being updated
  • 39. Priority Based Replication Updates are then sent based on priority Time intensive task to get priority scheme right Must play, take traces and decide whether the right decisions are being made by priority system when under load This hand tuning is critical to get the best polish for the game
  • 40. Host Migration Host migration is moving the hosting responsibilities from one box to another box Shadowrun only supports host initiated migration Eliminates the potential for exploitation Halo2’s host election process had unintended consequences Encouraged griefing
  • 41. Host Migration Host is migrated when: Host chooses to leave Game will end, hosting is migrated and host is then removed from game Current host is no longer best Host changed between games Consistent gameplay for duration of game Between rounds would perhaps have been better Game is prematurely stopped Host bandwidth no longer supports game size
  • 42. Matchmaking Good hosts Good bandwidth Open Nat High Hosting Reliability Good hosts will favor games without Will try to join a game that needs good host before trying to join one that does not But will only do so if game is a good match
  • 43. Putting It All Together Home is a hard place to serve from History attempts to identify players who can manage it well Design focuses on consistency Global bandwidth increases are made over multiple rounds of play Good hosts are rewarded by having host advantage and thus encouraging players to be good hosts
  • 44. Putting It All Together (cont.) Quick to respond to changing conditions Low level point-to-point control ensures continued connectivity This response happens very quickly If problems persist, global bandwidth control will kick in and reduce overall targets within minutes Host will eventually be replaced Bad host will have low reliable bandwidth
  • 45. Putting It All Together During Beta bandwidth histories accurately reflected player connection abilities System repeatedly found the same good hosts as system histories were reset Bad hosts did cause bad rounds of play but where quickly eliminated from pool of hosts in future games
  • 46. Wish List Ability for game to know when box has moved Create a signature that can be stored with the history that represents the network location For instance using the MAC of the local gateway along with perhaps routing information to known service Ability to manage QoS in the home Demands for bandwidth in the home are only going to get worse Efforts need to be made to help manage bandwidth across devices in the home
  • 47. Wish List (cont.) Consistent Bandwidth Control and Prediction across titles and platforms Game developers should be relieved of this job Common problem for many games History is applicable across titles and platforms Power Off UDP Packet Delivery Add ability for hardware to send out notification of powering down before actually powering down This will allow others who are connected to game to be notified of removal of box and thus can handle it gracefully

Editor's Notes

  • #4: Gamer needs to be a server We need a lot of bandwidth Early during Shadowrun, design wanted guarentee that they could do 16 players and really wanted to do more Not able to guarentee Design was not done, not sure what would require updating and how often Reality is that we always need to be smart about how we update
  • #5: Home environment is hostile Machines can move
  • #6: How do we adjust bandwidth? How do we adjust game state replication? When/why do we migrate? How we ensure hosts / how do we pick?
  • #7: Connection Control Global Control History Control
  • #8: Goal set by Global Control UDP Padding
  • #13: Packet Loss Subsequent Acks Timeouts Counters to filter out spurious
  • #14: Global Control Divide bandwidth among connections Grow till exceed and then backoff Ability to look at behavior across connections
  • #18: More then 50% in recovery Single bad connection does not cause signal Could prohibit growth, but we account for that
  • #19: Even distribution for the most part Some need extra due to voice Bandwidth might be limited for bad connections
  • #20: Bad connection Counter accumulates every period with congestion Decrements every period without Thresh means its bad – in dog house Can recover Bandwidth limited No longer restricts growth
  • #21: History Control Continuity Over Time Grow while able to deliver Back down if fail Measurements occur during periods
  • #22: Period starts when goal is set Period canceled if goal is changed Ends when goal is met for some time Success if no recoveries - recorded Unknown if some recovers – no recorded Failure if too many recoveries - recorded
  • #23: All success and failures recorded Circular buffer Holds hours of play From history can calculate reliable percentage / failure percentage
  • #26: Reliable Bandwidth == ReliablePercentage &gt;= 95%
  • #27: Reliable bandwidth used as goal No history get conservative value Step up bandwidth will be tried
  • #28: FailurePercentage less then or equal to 5% Consecutive successes after first step up will continue to step up Failure will cause it to fall back to lower bandwidth
  • #29: After failure, many successes needed to try again 1 in 20 ratio picked – 2 hours of play
  • #31: Good connection, Open Nat, Good Ping Default used Success Recorded Step up tried Success Recorded
  • #32: Step up again will continue Till failure Connections will hit RTT thresholds Latency spike across all connections Some lag in game will occur All connections in recovery
  • #33: Global control will notice majority recoveries causing recovery This will happen multiple times This will lead to failure Another step up will not occur for 20 periods
  • #35: Hiccup will be filtered at either connection of global level Sustained problem will be recorded If continues will trash history Player no longer reliable and will essentially be removed from pool
  • #36: Reliable bandwidth primary consideration Best reflects player/box ability to cope Must be able to meet game needs If best can’t be found, considers other possibilities in bands
  • #37: Pool built from criteria Tie breakers applied Ping, Nat Type, Host reliability
  • #39: Importance of type Time of last Client importance
  • #40: Updates sent based on priority Time intensive task Both dev wise and runtime Requires iteration / tweaking / gameplay Critical for polish
  • #41: Host migration Big debate for Shadowrun Halo 2 approach appears to encourage bad behavior Went with host initiated migration only – removing ability / incentive
  • #43: Searches by good hosts put themselves into games without good hosts
  • #44: Home sucks History identifies good hosts Design for consistency instead of rapid global adjustment Consistent good hosts picked / rewarded
  • #45: Quick to respond to changing conditions Bad hosts eliminated relative quickly Games likely to disband anyway
  • #46: Beta – good results Good hosts found time and time again when reset Bad hosts culled quickly