SlideShare a Scribd company logo
Please give me back my Network Cables!
On Networking Limits in AWS
S T E F F E N G E B E R T
M I K L O S T I R P A K
S R E C O N 2 0 2 5 A M E R I C A S
2 0 2 5 - 0 3 - 2 6
What’s this?
§ ”100G network cable”
100G Ethernet
Throughput Limits
§ 148,809,524pps!
@64 bytes/frame
§ Per direction!
FIXME get own image
Hopefully correct, for illustration purposes only. Don’t dimension your systems based on this.
Cloudy
Throughput Limits
4
EC2 Instance Types
Throughput Limits
5
General Purpose m7g Compute-optimized + network-enhanced c7gn
$950/mo $1450/mo
0
5
10
15
20
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
9
0
0
1
0
0
0
1
1
0
0
1
2
0
0
1
3
0
0
1
4
0
0
1
5
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
6
0
0
0
7
0
0
0
8
0
0
0
9
0
0
0
Gbps
Packet Size
0
5
10
15
20
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
9
0
0
1
0
0
0
1
1
0
0
1
2
0
0
1
3
0
0
1
4
0
0
1
5
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
6
0
0
0
7
0
0
0
8
0
0
0
9
0
0
0
Gbps
Packet Size
§ AWS limits throughput based on
• Bandwidth (data rate): 15 Gbps regardless of packet size (default MTU is 9001 byte)
• Packets per Second (PPS), i.e., rate of 1500 byte packets to reach 15 Gbps: 1.25Mpps
Real 15 Gbps – or again “up to”?
Throughput Limits
6
Our IoT use case
à 1.5 Gbps
Network Functions
Throughput Limits
7
Router /
Firewall etc.
NIC NIC
750Mbps
downstream
750Mbps
upstream
With a 15 Gbps instance, we can transfer bi-directional 750 Mbps - peak!
Disclaimer
§ We like working with AWS, it makes our life so
much easier!
§ But we want to share our journeys through
sleepless nights etc.
Thanks to..
§ Our colleagues, who went with us through this
§ AWS account team and network specialists
One Note...
8
Limits are critical in multi-tenant environments –
they protect (also us) from noisy neighbors!
emnify IoT SuperNetwork
9
Deployed in
7 breakout regions
Global IoT SIM card
Integration into customer
networks
Cellular connectivity for the
Internet of Things (IoT)
IoT-specific security features
Headquartered in Germany
Engineering EU remote
🇪🇺
🇩🇪
Why are AWS limits so relevant for us?
10
IoT device IoT backend
Internet
Mobile Network
Operators (MNO)
customer-owned nobody owns it
emnify-contracted emnify-owned
Unpredictable
traffic
Customer pays for
the traffic
EC2 instance type
economics
Burstable Instance Types (“up to X Gbps”)
Throughput Limits
https://docs.aws.amazon.com/ec2/latest/instancetypes/co.html#co_network 11
§ Documentation is clear about
baseline / burst bandwidth
§ Credits-based mechanism
§ On a best effort basis
§ Metrics
• bw_in_allowance_exceeded
• bw_out_allowance_exceeded
• pps_allowance_exceeded
§ Indicate queuing (and potential packet drops)
§ Amazon CloudWatch
§ ethtool –S ens5 to read from NIC driver (ENA)
• CloudWatch Agent install + configuration
• Prometheus node_exporter, with ethtool
collector (disabled by default)
§ Monitoring interval: Beware of bursts
Monitoring
Throughput Limits
12
§ Packet drops on a c6i.xlarge with very low traffic
• Bandwidth utilization is 15 Mbps
• bw_out_allowance_exceeded increases every 5 min
§ File upload to S3 every 5 min
• File size: 9 MB
• Data transmit completes within 26 ms
• à 9 MB/26 ms = 2.7 Gbps peak
Microbursts
Throughput Limits
13
§ Limits are implemented much more granular than
per second.
§ bw_out_allowance_exceeded does not
always mean dropped packets, only enqueued.
§ Make sure you monitor your application behavior.
§ "Bandwidth is a shared resource”
Instance burst is on a best effort basis, even when the
instance has credits available, as burst bandwidth is a
shared resource.
Microbursts
Throughput Limits
14
Amazonian: You write that you
observe bandwidth out allowance
exceeded metric increasing on an
instance that is barely used, which
should have enough network credits.
EM: Yes, metric increases every 5min.
Amazonian: Your packet size is only
1500 byte. Can you add a VPCe?
EM: Okay, now it increases even
faster.
Amazonian: Probably microburst on
PPS limit.
EM: The PPS metric is not increased.
Amazonian: Scale up the instance!
EM: No, you won’t get more money!
AWS Support
Increasing Limits
§ Add more network interfaces?
§ Change the instance type
• Scale up! 💸
• Use network-enhanced types 💸
• Use newer generation
(7th gen increased limits)
§ EC2 instance bandwidth weighting: 25% more
network capacity (requires. 8th gen instance)
§ Use jumbo frames inside AWS or via DX
Avoid Microbursts
§ AWS-CLI S3: default.s3.max_bandwidth
§ SO_MAX_PACING_RATE socket option for own
applications
§ Limit bandwidth using tc
§ Increase the Tx ring size with ethtool
Countermeasures
Throughput Limits
https ://repo s t.aws /kno wl edge-center/ec2-i ns tance -ex ceedi ng-netwo rk-l i mi ts 15
§ pps_allowance_exceeded increased,
while far away from the PPS limit
Amazonian: You write that
everything is fine but you have
packet loss metrics increasing.
EM: Yes, totally.
Amazonian: Fragmented packets?
EM: Oh, yes! How many can I?
Amazonian: Can’t tell ya
EM: Hold my beer. Like 1000?
Amazonian: Exactly. And btw. you
won’t get more when you scale up.
AWS Support
Issues
Fragmented Packets
https ://do cs .aws .amazo n.co m/AWSEC2/l ates t/Us erGui de/ena-ni tro-perf.html #ena-ni tro-perf-ex cepti o ns 16
Packet
Gateway
UDP 1500 byte
§ Not using Nitro’s hardware acceleration
(fast path)
§ Differs between
• Ingress: Nitro standard rate (slow path)*
• Egress: 1024 pps (intentionally slow path)
§ Same limit (1024pps) as link-local traffic, but
different bucket.
Understanding
Fragmented Packets
* UNDOCUM ENTED LIM IT, BUT GROWS WITH INSTANCE SIZE 17
§ pps_allowance_exceeded increased, while
far away from the PPS limit
§ Running tcpdump 'ip[6] = 32'
§ Ask AWS support!
Monitoring
Fragmented Packets
18
Avoid Fragmentation
§ Don’t fragment!
§ VPN / encapsulation use cases
• Lower MTU of tunnel interfaces: Fragment
the inner packets (hide them from AWS)
• Make sure path MTU discovery works
• TCP MSS clamping
Increase Throughput
§ Move fragmentation to TGW, if applicable
§ Use ENA’s fragment bypass support
(v2.13.3, released last week)
• enable_frag_bypass
• Increases egress throughput from 1024 pps
to the “standard rate” (not accelerated)
• Competes for Nitro CPU resources
Countermeasures
Fragmented Packets
https ://gi thub.co m/amzn/amzn-dri vers /bl ob/mas ter/kernel
/l i nux /ena/RELEASENOTES.md#r213 3-rel eas e-no tes
19
Conntrack
C O N N E C T I O N T R A C K I N G
20
State of Connections
§ Stateful firewall:
• Allow the returning flow of a connection
• Similar to NAT table, but slightly different
§ In Linux, mostly using iptables
Introduction
Connection Tracking
21
Firewall
NIC NIC
Source
IP Address
Source
Port
Destination
IP Address
Destination
Port
Protocol State
1.2.3.4 54321 5.6.7.8 80 TCP ESTABLISHED
1.2.3.5 54322 5.6.7.8 80 TCP SYN_SENT
1.2.3.5 43210 6.7.8.9 123 UDP SEEN_REPLY
State of Connections
§ Stateful firewall:
• Allow the returning flow of a connection
• Similar to NAT table, but slightly different
§ In Linux, mostly using iptables
AWS
§ Security groups: What makes them stateful
§ Limited number of entries
• Varies per EC2 instance type
• No official documentation of limits
Introduction
Connection Tracking
22
§ New TCP / UDP connections are blocked
• Incoming connections might fail
• DNS might (sometimes) not work
• AWS SSM might (sometimes) not work
Consequences
Connection Tracking
23
PGW
PGW
VPN
Monitoring
Connection Tracking
Capacity Exceeded
§ conntrack_allowance_exceeded
§ since 2021
Remaining Capacity
§ conntrack_allowance_available
§ since Jan 2023
24
c5 ..n
large 136k 136k
xlarge 273k 273k
c6g ..n
153k 153k
307k 307k
+12% compared to
previous generation
c7g ..n
153k 205k
307k 410k
+33%
c8g ..n
153k N/A
307k N/A
?
§ Requires young enough ENA driver version, not included in
Ubuntu 24.04 LTS. We have own version compiled.
§ Counters to be interpreted differently
• Exceeded counter: refused connections per interface
(use the sum of all interfaces)
• Available counter: each interface shows the same value,
the remaining for the EC2 instance
(use avg or pick a single interface)
§ Observed different values for the available counter on
different interfaces – only 1st interface was up-to-date
(bug #291, fixed)
§ Flow logs can help, but setup is cumbersome
Monitoring Challenges
Connection Tracking
https ://gi thub.co m/amzn/amzn-dri vers /i s s ues /291 25
Countermeasures
Connection Tracking
Disable connection tracking
§ Have a rule in the reverse direction allowing
traffic from all IP addresses1:
§ Network functions, imagine NAT instance
§ Beware of the security implications! 1) EXCEPT “AUTOM ATICALLY TRACKED”
https ://docs .aws .amazon.com/AWSEC2/l ates t/Us erGui de/s ecuri ty-group-co nnecti o n-tracki ng.html 26
Ingress Rules
Local Port Source IP
80 0.0.0.0/0
Ingress Rules
Local Port Source IP
* 0.0.0.0/0
Egress Rules
Destination IP Remote Port
0.0.0.0/0 *
Egress Rules
Destination IP Remote Port
0.0.0.0/0 *
Automatically Tracked Connections
https ://web.archi ve.o rg/web/20240521040710/https ://do cs .aws .amazo n.co m/AWSEC2/l ates t/Us erGui de/s ecuri ty-group-co nnecti o n-tracki ng.html 27
Countermeasures
Connection Tracking
Disable connection tracking
§ Have a rule in the reverse direction allowing
traffic from all IP addresses1:
§ Network functions, imagine NAT instance
§ Beware of the security implications!
Live with it
§ Change EC2 instance
• Scale up instance size, as it doubles 💸
• Change instance type (cf. table before)
§ Lower idle timers of network interface
§ Lower cardinality (easier said than done)
28
PGW
PGW
VPN
Ingress Rules
Local Port Source IP
80 0.0.0.0/0
Ingress Rules
Local Port Source IP
* 0.0.0.0/0
Egress Rules
Destination IP Remote Port
0.0.0.0/0 *
Egress Rules
Destination IP Remote Port
0.0.0.0/0 *
§ The cloud has limits to protect from noisy
neighbors
• Often not clear, what the limits are
• AWS documentation got a lot better (we
should read it more often!)
§ Monitoring capabilities are helpful
• We can’t rescue every packet
• Sometimes still a mystery
§ Other clouds?
Summary
29
§ This is My Architecture: emnify
§ Evaluation SIM Cards
§ AWS re:Invent 2024 - EC2 Nitro networking
under the hood (NET402)
§ EC2 User Guide – Instance network
bandwidth
§ AWS Blogs - Amazon EC2 instance-level
network performance metrics uncover new
insights
§ AWS re:post - Why does my Amazon EC2
instance exceed its network limits when
average utilization is low?
Resources
30

More Related Content

PDF
Installation of pfSense on Soekris 6501
PDF
Installation of pfSense on Soekris 6501
PDF
Deep Dive on Amazon EC2 Instances (March 2017)
PPT
re7jweiss
PDF
FPGA based 10G Performance Tester for HW OpenFlow Switch
PPTX
Vpc aws meetup
PPT
Internet service
PPTX
Docker networking basics & coupling with Software Defined Networks
Installation of pfSense on Soekris 6501
Installation of pfSense on Soekris 6501
Deep Dive on Amazon EC2 Instances (March 2017)
re7jweiss
FPGA based 10G Performance Tester for HW OpenFlow Switch
Vpc aws meetup
Internet service
Docker networking basics & coupling with Software Defined Networks

Similar to Please Give Me Back My Network Cables! On Networking Limits in AWS (17)

PPTX
Meetup docker using software defined networks
PPTX
CPN302 your-linux-ami-optimization-and-performance
PDF
How our Cloudy Mindsets Approached Physical Routers
PPT
Oow2007 performance
PDF
Elasticsearch on Kubernetes
PPT
#VMUGMTL - Xsigo Breakout
PDF
640 802 exam
PDF
Đề Thi Trắc Nghiệm CCNA Full
PDF
lecciones ccna3
DOC
weblogic perfomence tuning
PDF
26.1.7 lab snort and firewall rules
PPTX
Production Grade Kubernetes Applications
PDF
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
PDF
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
PDF
200-301-demo.pdf
PDF
Cisco 200-301 Exam Dumps
PDF
Cisco 200-301 Exam Dumps
Meetup docker using software defined networks
CPN302 your-linux-ami-optimization-and-performance
How our Cloudy Mindsets Approached Physical Routers
Oow2007 performance
Elasticsearch on Kubernetes
#VMUGMTL - Xsigo Breakout
640 802 exam
Đề Thi Trắc Nghiệm CCNA Full
lecciones ccna3
weblogic perfomence tuning
26.1.7 lab snort and firewall rules
Production Grade Kubernetes Applications
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
200-301-demo.pdf
Cisco 200-301 Exam Dumps
Cisco 200-301 Exam Dumps
Ad

More from Steffen Gebert (20)

PDF
Building an IoT SuperNetwork on top of the AWS Global Infrastructure
PDF
Wenn selbst ‘erlaube allen Verkehr von 0.0.0.0/0’ nicht hilft - Verbindungspr...
PDF
Feature Management Platforms
PDF
Serverless Networking - How We Provide Cloud-Native Connectivity for IoT Devices
PDF
Jenkins vs. AWS CodePipeline (AWS User Group Berlin)
PDF
Jenkins vs. AWS CodePipeline
PDF
Monitoring Akka with Kamon 1.0
PDF
(Declarative) Jenkins Pipelines
PDF
An Open-Source Chef Cookbook CI/CD Implementation Using Jenkins Pipelines
PPTX
Continuous Delivery
PDF
Jenkins Pipelines
PDF
Let's go HTTPS-only! - More Than Buying a Certificate
PDF
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
PDF
Investigating the Impact of Network Topology on the Processing Times of SDN C...
PDF
SDN interfaces and performance analysis of SDN components
PDF
Git Power-Workshop
PDF
The Development Infrastructure of the TYPO3 Project
PDF
Der Weg zu TYPO3 CMS 6.0 und Einblicke in die TYPO3-Entwicklung
PDF
Official typo3.org infrastructure &
the TYPO3 Server Admin Team
PDF
Neuigkeiten aus dem TYPO3-Projekt
Building an IoT SuperNetwork on top of the AWS Global Infrastructure
Wenn selbst ‘erlaube allen Verkehr von 0.0.0.0/0’ nicht hilft - Verbindungspr...
Feature Management Platforms
Serverless Networking - How We Provide Cloud-Native Connectivity for IoT Devices
Jenkins vs. AWS CodePipeline (AWS User Group Berlin)
Jenkins vs. AWS CodePipeline
Monitoring Akka with Kamon 1.0
(Declarative) Jenkins Pipelines
An Open-Source Chef Cookbook CI/CD Implementation Using Jenkins Pipelines
Continuous Delivery
Jenkins Pipelines
Let's go HTTPS-only! - More Than Buying a Certificate
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
Investigating the Impact of Network Topology on the Processing Times of SDN C...
SDN interfaces and performance analysis of SDN components
Git Power-Workshop
The Development Infrastructure of the TYPO3 Project
Der Weg zu TYPO3 CMS 6.0 und Einblicke in die TYPO3-Entwicklung
Official typo3.org infrastructure &
the TYPO3 Server Admin Team
Neuigkeiten aus dem TYPO3-Projekt
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Please Give Me Back My Network Cables! On Networking Limits in AWS

  • 1. Please give me back my Network Cables! On Networking Limits in AWS S T E F F E N G E B E R T M I K L O S T I R P A K S R E C O N 2 0 2 5 A M E R I C A S 2 0 2 5 - 0 3 - 2 6
  • 2. What’s this? § ”100G network cable”
  • 3. 100G Ethernet Throughput Limits § 148,809,524pps! @64 bytes/frame § Per direction! FIXME get own image Hopefully correct, for illustration purposes only. Don’t dimension your systems based on this.
  • 5. EC2 Instance Types Throughput Limits 5 General Purpose m7g Compute-optimized + network-enhanced c7gn $950/mo $1450/mo
  • 6. 0 5 10 15 20 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0 1 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 Gbps Packet Size 0 5 10 15 20 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0 1 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 Gbps Packet Size § AWS limits throughput based on • Bandwidth (data rate): 15 Gbps regardless of packet size (default MTU is 9001 byte) • Packets per Second (PPS), i.e., rate of 1500 byte packets to reach 15 Gbps: 1.25Mpps Real 15 Gbps – or again “up to”? Throughput Limits 6 Our IoT use case à 1.5 Gbps
  • 7. Network Functions Throughput Limits 7 Router / Firewall etc. NIC NIC 750Mbps downstream 750Mbps upstream With a 15 Gbps instance, we can transfer bi-directional 750 Mbps - peak!
  • 8. Disclaimer § We like working with AWS, it makes our life so much easier! § But we want to share our journeys through sleepless nights etc. Thanks to.. § Our colleagues, who went with us through this § AWS account team and network specialists One Note... 8 Limits are critical in multi-tenant environments – they protect (also us) from noisy neighbors!
  • 9. emnify IoT SuperNetwork 9 Deployed in 7 breakout regions Global IoT SIM card Integration into customer networks Cellular connectivity for the Internet of Things (IoT) IoT-specific security features Headquartered in Germany Engineering EU remote 🇪🇺 🇩🇪
  • 10. Why are AWS limits so relevant for us? 10 IoT device IoT backend Internet Mobile Network Operators (MNO) customer-owned nobody owns it emnify-contracted emnify-owned Unpredictable traffic Customer pays for the traffic EC2 instance type economics
  • 11. Burstable Instance Types (“up to X Gbps”) Throughput Limits https://docs.aws.amazon.com/ec2/latest/instancetypes/co.html#co_network 11 § Documentation is clear about baseline / burst bandwidth § Credits-based mechanism § On a best effort basis
  • 12. § Metrics • bw_in_allowance_exceeded • bw_out_allowance_exceeded • pps_allowance_exceeded § Indicate queuing (and potential packet drops) § Amazon CloudWatch § ethtool –S ens5 to read from NIC driver (ENA) • CloudWatch Agent install + configuration • Prometheus node_exporter, with ethtool collector (disabled by default) § Monitoring interval: Beware of bursts Monitoring Throughput Limits 12
  • 13. § Packet drops on a c6i.xlarge with very low traffic • Bandwidth utilization is 15 Mbps • bw_out_allowance_exceeded increases every 5 min § File upload to S3 every 5 min • File size: 9 MB • Data transmit completes within 26 ms • à 9 MB/26 ms = 2.7 Gbps peak Microbursts Throughput Limits 13
  • 14. § Limits are implemented much more granular than per second. § bw_out_allowance_exceeded does not always mean dropped packets, only enqueued. § Make sure you monitor your application behavior. § "Bandwidth is a shared resource” Instance burst is on a best effort basis, even when the instance has credits available, as burst bandwidth is a shared resource. Microbursts Throughput Limits 14 Amazonian: You write that you observe bandwidth out allowance exceeded metric increasing on an instance that is barely used, which should have enough network credits. EM: Yes, metric increases every 5min. Amazonian: Your packet size is only 1500 byte. Can you add a VPCe? EM: Okay, now it increases even faster. Amazonian: Probably microburst on PPS limit. EM: The PPS metric is not increased. Amazonian: Scale up the instance! EM: No, you won’t get more money! AWS Support
  • 15. Increasing Limits § Add more network interfaces? § Change the instance type • Scale up! 💸 • Use network-enhanced types 💸 • Use newer generation (7th gen increased limits) § EC2 instance bandwidth weighting: 25% more network capacity (requires. 8th gen instance) § Use jumbo frames inside AWS or via DX Avoid Microbursts § AWS-CLI S3: default.s3.max_bandwidth § SO_MAX_PACING_RATE socket option for own applications § Limit bandwidth using tc § Increase the Tx ring size with ethtool Countermeasures Throughput Limits https ://repo s t.aws /kno wl edge-center/ec2-i ns tance -ex ceedi ng-netwo rk-l i mi ts 15
  • 16. § pps_allowance_exceeded increased, while far away from the PPS limit Amazonian: You write that everything is fine but you have packet loss metrics increasing. EM: Yes, totally. Amazonian: Fragmented packets? EM: Oh, yes! How many can I? Amazonian: Can’t tell ya EM: Hold my beer. Like 1000? Amazonian: Exactly. And btw. you won’t get more when you scale up. AWS Support Issues Fragmented Packets https ://do cs .aws .amazo n.co m/AWSEC2/l ates t/Us erGui de/ena-ni tro-perf.html #ena-ni tro-perf-ex cepti o ns 16 Packet Gateway UDP 1500 byte
  • 17. § Not using Nitro’s hardware acceleration (fast path) § Differs between • Ingress: Nitro standard rate (slow path)* • Egress: 1024 pps (intentionally slow path) § Same limit (1024pps) as link-local traffic, but different bucket. Understanding Fragmented Packets * UNDOCUM ENTED LIM IT, BUT GROWS WITH INSTANCE SIZE 17
  • 18. § pps_allowance_exceeded increased, while far away from the PPS limit § Running tcpdump 'ip[6] = 32' § Ask AWS support! Monitoring Fragmented Packets 18
  • 19. Avoid Fragmentation § Don’t fragment! § VPN / encapsulation use cases • Lower MTU of tunnel interfaces: Fragment the inner packets (hide them from AWS) • Make sure path MTU discovery works • TCP MSS clamping Increase Throughput § Move fragmentation to TGW, if applicable § Use ENA’s fragment bypass support (v2.13.3, released last week) • enable_frag_bypass • Increases egress throughput from 1024 pps to the “standard rate” (not accelerated) • Competes for Nitro CPU resources Countermeasures Fragmented Packets https ://gi thub.co m/amzn/amzn-dri vers /bl ob/mas ter/kernel /l i nux /ena/RELEASENOTES.md#r213 3-rel eas e-no tes 19
  • 20. Conntrack C O N N E C T I O N T R A C K I N G 20
  • 21. State of Connections § Stateful firewall: • Allow the returning flow of a connection • Similar to NAT table, but slightly different § In Linux, mostly using iptables Introduction Connection Tracking 21 Firewall NIC NIC Source IP Address Source Port Destination IP Address Destination Port Protocol State 1.2.3.4 54321 5.6.7.8 80 TCP ESTABLISHED 1.2.3.5 54322 5.6.7.8 80 TCP SYN_SENT 1.2.3.5 43210 6.7.8.9 123 UDP SEEN_REPLY
  • 22. State of Connections § Stateful firewall: • Allow the returning flow of a connection • Similar to NAT table, but slightly different § In Linux, mostly using iptables AWS § Security groups: What makes them stateful § Limited number of entries • Varies per EC2 instance type • No official documentation of limits Introduction Connection Tracking 22
  • 23. § New TCP / UDP connections are blocked • Incoming connections might fail • DNS might (sometimes) not work • AWS SSM might (sometimes) not work Consequences Connection Tracking 23 PGW PGW VPN
  • 24. Monitoring Connection Tracking Capacity Exceeded § conntrack_allowance_exceeded § since 2021 Remaining Capacity § conntrack_allowance_available § since Jan 2023 24 c5 ..n large 136k 136k xlarge 273k 273k c6g ..n 153k 153k 307k 307k +12% compared to previous generation c7g ..n 153k 205k 307k 410k +33% c8g ..n 153k N/A 307k N/A ?
  • 25. § Requires young enough ENA driver version, not included in Ubuntu 24.04 LTS. We have own version compiled. § Counters to be interpreted differently • Exceeded counter: refused connections per interface (use the sum of all interfaces) • Available counter: each interface shows the same value, the remaining for the EC2 instance (use avg or pick a single interface) § Observed different values for the available counter on different interfaces – only 1st interface was up-to-date (bug #291, fixed) § Flow logs can help, but setup is cumbersome Monitoring Challenges Connection Tracking https ://gi thub.co m/amzn/amzn-dri vers /i s s ues /291 25
  • 26. Countermeasures Connection Tracking Disable connection tracking § Have a rule in the reverse direction allowing traffic from all IP addresses1: § Network functions, imagine NAT instance § Beware of the security implications! 1) EXCEPT “AUTOM ATICALLY TRACKED” https ://docs .aws .amazon.com/AWSEC2/l ates t/Us erGui de/s ecuri ty-group-co nnecti o n-tracki ng.html 26 Ingress Rules Local Port Source IP 80 0.0.0.0/0 Ingress Rules Local Port Source IP * 0.0.0.0/0 Egress Rules Destination IP Remote Port 0.0.0.0/0 * Egress Rules Destination IP Remote Port 0.0.0.0/0 *
  • 27. Automatically Tracked Connections https ://web.archi ve.o rg/web/20240521040710/https ://do cs .aws .amazo n.co m/AWSEC2/l ates t/Us erGui de/s ecuri ty-group-co nnecti o n-tracki ng.html 27
  • 28. Countermeasures Connection Tracking Disable connection tracking § Have a rule in the reverse direction allowing traffic from all IP addresses1: § Network functions, imagine NAT instance § Beware of the security implications! Live with it § Change EC2 instance • Scale up instance size, as it doubles 💸 • Change instance type (cf. table before) § Lower idle timers of network interface § Lower cardinality (easier said than done) 28 PGW PGW VPN Ingress Rules Local Port Source IP 80 0.0.0.0/0 Ingress Rules Local Port Source IP * 0.0.0.0/0 Egress Rules Destination IP Remote Port 0.0.0.0/0 * Egress Rules Destination IP Remote Port 0.0.0.0/0 *
  • 29. § The cloud has limits to protect from noisy neighbors • Often not clear, what the limits are • AWS documentation got a lot better (we should read it more often!) § Monitoring capabilities are helpful • We can’t rescue every packet • Sometimes still a mystery § Other clouds? Summary 29
  • 30. § This is My Architecture: emnify § Evaluation SIM Cards § AWS re:Invent 2024 - EC2 Nitro networking under the hood (NET402) § EC2 User Guide – Instance network bandwidth § AWS Blogs - Amazon EC2 instance-level network performance metrics uncover new insights § AWS re:post - Why does my Amazon EC2 instance exceed its network limits when average utilization is low? Resources 30