SlideShare a Scribd company logo
Learning from failures
Yoshinobu ‘maz’ Matsuzaki
<maz@iij.ad.jp>
bdNOG12 maz@iij.ad.jp 1
Reliability is getting important
• More use of the Internet
• COVID-19 has been pushing digitalization
• Bandwidth is a key
• When congestion occurs, the experience gets worse
• But enough bandwidth just is not enough
• Even if you have it set up wrong, you can still use it somehow
• Reasonability, stability and resiliency is the other key
bdNOG12 maz@iij.ad.jp 2
Risk prediction training
1. Understanding the situation
• Discuss imaginable hazard scenario in the given situation.
2. Determining risks
• Identify the hazards that need to be addressed
3. Establishing countermeasures
• Discuss possible measures to solve the hazards
4. Setting goals
• Selecting possible measures to implement
bdNOG12 maz@iij.ad.jp 3
Example1: Routing
• An ISP assigns /24 for a customer
• ISP set up a static route for the link
• The customer set up a default route
to the uplink
• The customer uses /28 out of the
/24
10.0.0.0/24
10.0.0.0/28
static route
static route
default
bdNOG12 maz@iij.ad.jp 4
Example1: Risks
• If a packet comes to an address
other than the /28 out of the /24,
the packet will be looped
• If the customer's LAN-side interface
is down, all packets destined for the
/24 will be looped.
• Routing loop!
10.0.0.0/24
10.0.0.0/28
static route
static route
default
A packet
to: 10.0.0.99
bdNOG12 maz@iij.ad.jp 5
Example1: Measures
• Implementing dynamic routing
between ISP and the customer
• Configuring a static route on the
customer's router that directs the
same /24 to null
10.0.0.0/24
10.0.0.0/28
static route
static route
default
bdNOG12 maz@iij.ad.jp 6
Example1: Adopting
• Configuring a static route on the
customer's side router that directs
the same /24 to null
10.0.0.0/24
10.0.0.0/28
static route
static route
default
10.0.0.0/24
static null route
bdNOG12 maz@iij.ad.jp 7
Example2: Port assignments
• Removing a cable from port X
• Just to be safe, make sure the LED is off before pulling it out
• But can you spot the right port for sure?
bdNOG12 maz@iij.ad.jp 8
1 2 3
4 5 6
Straight forward
Starting from port 0The left LED is for LC status
More efficient but confusable A little clearer
Port 21 is the SFP now
bdNOG12 maz@iij.ad.jp 9
And more...
• We may see a different implementation in the future
• Assumptions are the source of accidents!
• Different products have different
port/LED assignments
• These caused confusion
bdNOG12 maz@iij.ad.jp 10
The more you know, the more you can see
• A variety of experience helps us to better consider the
hazards
• and to identify risks
• Technical education and proper training are necessary to
improve operational skills
• bdNOG workshops and tutorials are helpful
• There is always a need for appropriate educational
materials
bdNOG12 maz@iij.ad.jp 11
Mistakes!
• Mistakes can be a very good teaching tool
• There is a lot to learn from mistakes in the case studies
• There are some special cases, but there are also many common
failures and lessons to be learned by comparing them to your
own situation
• But as a business, we need to stop repeating failures in
our service facilities
• It damages reliability
bdNOG12 maz@iij.ad.jp 12
Build a database of mistakes
• It can be a great teaching tool for engineers!
• not to reproduce the similar mistakes
• You may find common and frequent mistakes
• If you can find the root cause of the failure, you can come up
with a more effective solution
bdNOG12 maz@iij.ad.jp 13
Mistake trend analysis
• Identify the high-impact mistakes
• Minimize the bad effects
• Reduce mistakes
bdNOG12 maz@iij.ad.jp 14
effects of mistakes
frequency
of mistakes
should not be
happened
problemsmatters
problems
Accident investigation committee
• In some industries, Accident Investigation Committees
conduct detailed investigations and compile reports in
order to prevent the repeating of serious accidents
• Maybe bdNOG can do this as a community activity
• For the healthy development of the Internet in Bangladesh
• Regular reports of accident cases during bdNOG meetings
bdNOG12 maz@iij.ad.jp 15
Summary
• To have a reliable network, we need to continuously
improve our operations
• The use of failure cases allows for more effective risk
analysis and countermeasures
• As bdNOG community, I believe the following are worth
considering
• Collection of failure and mistake cases
• Trials of accident analysis
bdNOG12 maz@iij.ad.jp 16

More Related Content

PDF
Challenges for BdREN in COVID Environment
PDF
Lifting the Lid on Lawful Intercept
PPT
SPOCS Presentation EEMA Conference London June 2010
PPT
Using eID for business startup in Europe
PDF
Bof4162 kovalsky
PPTX
Cloud Billing: Enabling consumers for pay for what they use
PPTX
UAT Validation in Production
Challenges for BdREN in COVID Environment
Lifting the Lid on Lawful Intercept
SPOCS Presentation EEMA Conference London June 2010
Using eID for business startup in Europe
Bof4162 kovalsky
Cloud Billing: Enabling consumers for pay for what they use
UAT Validation in Production

Similar to Learning from failures (20)

PPTX
Virtual Private Data Center Solution Overview
PPTX
Network Centric Cloud: Competing in a IT World with a Telecom Approach
PDF
“Deep Learning for Manufacturing Inspection: Case Studies,” a Presentation fr...
PDF
Network performance optimisation using high-fidelity measures
PDF
Final observability starts_with_data
PDF
Software Principles and Project Deadlines Don't have to be Polar Opposites.pdf
PDF
Top Challenges in Testing Requirements
PPTX
Getting Started with ThousandEyes Proof of Concepts
PPTX
Data Con LA 2022 - Who Owns That Yacht? How Graphs Are Used to Identify Asset...
PPTX
Solar School Bell using IOT Automation
PPTX
Examples how to move towards Zero Defects
PPTX
Performance Warrior Tales: Cloud Load Testing the Retail Giants
PPTX
Performance Warrior Tales: Cloud Load Testing the Retail Giants
PPTX
Getting Started with ThousandEyes Proof of Concepts
PPTX
CCDE Experience
PDF
Nuclear Sector Deal webinar series 2021. Cost reduction in nuclear new build....
PDF
Revolucion movil telesemana
PDF
"How to create usless software... and distribute it" (Alto university lecture...
PPTX
Getting Demo & POV Ready
PDF
Managing DB2 workloads by IBA Group
Virtual Private Data Center Solution Overview
Network Centric Cloud: Competing in a IT World with a Telecom Approach
“Deep Learning for Manufacturing Inspection: Case Studies,” a Presentation fr...
Network performance optimisation using high-fidelity measures
Final observability starts_with_data
Software Principles and Project Deadlines Don't have to be Polar Opposites.pdf
Top Challenges in Testing Requirements
Getting Started with ThousandEyes Proof of Concepts
Data Con LA 2022 - Who Owns That Yacht? How Graphs Are Used to Identify Asset...
Solar School Bell using IOT Automation
Examples how to move towards Zero Defects
Performance Warrior Tales: Cloud Load Testing the Retail Giants
Performance Warrior Tales: Cloud Load Testing the Retail Giants
Getting Started with ThousandEyes Proof of Concepts
CCDE Experience
Nuclear Sector Deal webinar series 2021. Cost reduction in nuclear new build....
Revolucion movil telesemana
"How to create usless software... and distribute it" (Alto university lecture...
Getting Demo & POV Ready
Managing DB2 workloads by IBA Group
Ad

More from Bangladesh Network Operators Group (20)

PDF
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
PDF
IPv6 Mostly Experience at APRICOT by Yoshinobu Matsuzaki (IIJ)
PDF
Fast Reroute in SR-MPLS by Md Abdullah Al Naser
PDF
DDoS Mitigation Strategies by Md. Abdul Awal
PDF
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
PDF
Optics101 for non-Optical (IP) folks by Tashi Phuntsho
PPTX
The Internet Service Providers and Connectivity Providers of ICANN
PPTX
Integration of AI and GenAI in Education and beyond
PPTX
Strengthening Cyber Security with Tools and Human Expertise
PDF
Mental Health and Workplace Culture in Tech:A Personal Perspective
PDF
Network Efficiency:The LLM Advantage on network infrastructures
PDF
Utilizing Free and open-source Technology and Achieve Next Generation Enterpr...
PPTX
BDNOG17 Plenary Session, Security Concerns: A perspective in Smart Bangladesh
PPTX
Maximizing Network Efficiency with Large Language Models (LLM)
PPTX
Geolocation and Geofeed Implementation bdNOG18
PDF
Data Centre Design Consideration for Bangladesh
PDF
DNS Troubleshooting - Assumptions and Problem Breakdown
PPTX
Team Cymru Community Services,Overview of all public services
PPTX
Open Source TCP or Netflow Log Server Using Graylog
PPTX
Enhancing seamless access using TIGERfed
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
IPv6 Mostly Experience at APRICOT by Yoshinobu Matsuzaki (IIJ)
Fast Reroute in SR-MPLS by Md Abdullah Al Naser
DDoS Mitigation Strategies by Md. Abdul Awal
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
Optics101 for non-Optical (IP) folks by Tashi Phuntsho
The Internet Service Providers and Connectivity Providers of ICANN
Integration of AI and GenAI in Education and beyond
Strengthening Cyber Security with Tools and Human Expertise
Mental Health and Workplace Culture in Tech:A Personal Perspective
Network Efficiency:The LLM Advantage on network infrastructures
Utilizing Free and open-source Technology and Achieve Next Generation Enterpr...
BDNOG17 Plenary Session, Security Concerns: A perspective in Smart Bangladesh
Maximizing Network Efficiency with Large Language Models (LLM)
Geolocation and Geofeed Implementation bdNOG18
Data Centre Design Consideration for Bangladesh
DNS Troubleshooting - Assumptions and Problem Breakdown
Team Cymru Community Services,Overview of all public services
Open Source TCP or Netflow Log Server Using Graylog
Enhancing seamless access using TIGERfed
Ad

Recently uploaded (20)

PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PDF
Understand the Gitlab_presentation_task.pdf
PDF
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
PDF
Exploring The Internet Of Things(IOT).ppt
PDF
Slides PDF: The World Game (s) Eco Economic Epochs.pdf
PDF
Containerization lab dddddddddddddddmanual.pdf
PPT
250152213-Excitation-SystemWERRT (1).ppt
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PPTX
Layers_of_the_Earth_Grade7.pptx class by
PDF
Alethe Consulting Corporate Profile and Solution Aproach
PPTX
newyork.pptxirantrafgshenepalchinachinane
PDF
simpleintnettestmetiaerl for the simple testint
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PPTX
Slides PPTX: World Game (s): Eco Economic Epochs.pptx
PPTX
module 1-Part 1.pptxdddddddddddddddddddddddddddddddddddd
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PDF
Lean-Manufacturing-Tools-Techniques-and-How-To-Use-Them.pdf
PPTX
1402_iCSC_-_RESTful_Web_APIs_--_Josef_Hammer.pptx
PPT
Ethics in Information System - Management Information System
PPTX
Mathew Digital SEO Checklist Guidlines 2025
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Understand the Gitlab_presentation_task.pdf
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
Exploring The Internet Of Things(IOT).ppt
Slides PDF: The World Game (s) Eco Economic Epochs.pdf
Containerization lab dddddddddddddddmanual.pdf
250152213-Excitation-SystemWERRT (1).ppt
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Layers_of_the_Earth_Grade7.pptx class by
Alethe Consulting Corporate Profile and Solution Aproach
newyork.pptxirantrafgshenepalchinachinane
simpleintnettestmetiaerl for the simple testint
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
Slides PPTX: World Game (s): Eco Economic Epochs.pptx
module 1-Part 1.pptxdddddddddddddddddddddddddddddddddddd
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
Lean-Manufacturing-Tools-Techniques-and-How-To-Use-Them.pdf
1402_iCSC_-_RESTful_Web_APIs_--_Josef_Hammer.pptx
Ethics in Information System - Management Information System
Mathew Digital SEO Checklist Guidlines 2025

Learning from failures

  • 1. Learning from failures Yoshinobu ‘maz’ Matsuzaki <maz@iij.ad.jp> bdNOG12 maz@iij.ad.jp 1
  • 2. Reliability is getting important • More use of the Internet • COVID-19 has been pushing digitalization • Bandwidth is a key • When congestion occurs, the experience gets worse • But enough bandwidth just is not enough • Even if you have it set up wrong, you can still use it somehow • Reasonability, stability and resiliency is the other key bdNOG12 maz@iij.ad.jp 2
  • 3. Risk prediction training 1. Understanding the situation • Discuss imaginable hazard scenario in the given situation. 2. Determining risks • Identify the hazards that need to be addressed 3. Establishing countermeasures • Discuss possible measures to solve the hazards 4. Setting goals • Selecting possible measures to implement bdNOG12 maz@iij.ad.jp 3
  • 4. Example1: Routing • An ISP assigns /24 for a customer • ISP set up a static route for the link • The customer set up a default route to the uplink • The customer uses /28 out of the /24 10.0.0.0/24 10.0.0.0/28 static route static route default bdNOG12 maz@iij.ad.jp 4
  • 5. Example1: Risks • If a packet comes to an address other than the /28 out of the /24, the packet will be looped • If the customer's LAN-side interface is down, all packets destined for the /24 will be looped. • Routing loop! 10.0.0.0/24 10.0.0.0/28 static route static route default A packet to: 10.0.0.99 bdNOG12 maz@iij.ad.jp 5
  • 6. Example1: Measures • Implementing dynamic routing between ISP and the customer • Configuring a static route on the customer's router that directs the same /24 to null 10.0.0.0/24 10.0.0.0/28 static route static route default bdNOG12 maz@iij.ad.jp 6
  • 7. Example1: Adopting • Configuring a static route on the customer's side router that directs the same /24 to null 10.0.0.0/24 10.0.0.0/28 static route static route default 10.0.0.0/24 static null route bdNOG12 maz@iij.ad.jp 7
  • 8. Example2: Port assignments • Removing a cable from port X • Just to be safe, make sure the LED is off before pulling it out • But can you spot the right port for sure? bdNOG12 maz@iij.ad.jp 8
  • 9. 1 2 3 4 5 6 Straight forward Starting from port 0The left LED is for LC status More efficient but confusable A little clearer Port 21 is the SFP now bdNOG12 maz@iij.ad.jp 9
  • 10. And more... • We may see a different implementation in the future • Assumptions are the source of accidents! • Different products have different port/LED assignments • These caused confusion bdNOG12 maz@iij.ad.jp 10
  • 11. The more you know, the more you can see • A variety of experience helps us to better consider the hazards • and to identify risks • Technical education and proper training are necessary to improve operational skills • bdNOG workshops and tutorials are helpful • There is always a need for appropriate educational materials bdNOG12 maz@iij.ad.jp 11
  • 12. Mistakes! • Mistakes can be a very good teaching tool • There is a lot to learn from mistakes in the case studies • There are some special cases, but there are also many common failures and lessons to be learned by comparing them to your own situation • But as a business, we need to stop repeating failures in our service facilities • It damages reliability bdNOG12 maz@iij.ad.jp 12
  • 13. Build a database of mistakes • It can be a great teaching tool for engineers! • not to reproduce the similar mistakes • You may find common and frequent mistakes • If you can find the root cause of the failure, you can come up with a more effective solution bdNOG12 maz@iij.ad.jp 13
  • 14. Mistake trend analysis • Identify the high-impact mistakes • Minimize the bad effects • Reduce mistakes bdNOG12 maz@iij.ad.jp 14 effects of mistakes frequency of mistakes should not be happened problemsmatters problems
  • 15. Accident investigation committee • In some industries, Accident Investigation Committees conduct detailed investigations and compile reports in order to prevent the repeating of serious accidents • Maybe bdNOG can do this as a community activity • For the healthy development of the Internet in Bangladesh • Regular reports of accident cases during bdNOG meetings bdNOG12 maz@iij.ad.jp 15
  • 16. Summary • To have a reliable network, we need to continuously improve our operations • The use of failure cases allows for more effective risk analysis and countermeasures • As bdNOG community, I believe the following are worth considering • Collection of failure and mistake cases • Trials of accident analysis bdNOG12 maz@iij.ad.jp 16