SlideShare a Scribd company logo
Network State Awareness
and Troubleshooting
Faraz Shamim Aamer Akhter
Technical Leader Director Product Management
Cisco Systems Inc
• Troubleshooting Methodology
• Packet Forwarding Review
• Control Plane
• Active Monitoring
• Logging
• Routing Protocol Stability
• Data Plane
• Active Monitoring
• Passive Flow Monitoring
• QoS
• Getting Started
Agenda
• This session is about basic network troubleshooting,
focusing on fault detection & isolation
• Mostly, vendor neutral
• For context, we will cover some basic methodologies
and functional elements of network behavior
• This session is NOT about
• Architectures of specific platforms
• Data Center technologies
• Routing Protocols Troubleshooting
• This is a 90 min tour. ;-)
Keeping Focused: What This Session is About
3
The Big Picture
network
Network Operator
Server
Client
Application Operator
Not
happy
It’s not
the
network
It’s the
network
Is it
Monday?
Pings
fine!
Can’t
ping it.
Internet’s
down.
Somebody's
downloading
something.
(?)
4
Enterprise
DC
• A lot of stuff going on
• Multiple networks
• Multiple applications
• Multiple layered services
• Mis-information / inconsistency
Some More (network) Detail
LAN
Server A
Client
Not
happy
ISP A
Enterprise
WAN
Server B
Internet
DNS
DHCP
802.1x
DNS
5
ISP B
Enterprise
DC
• Redundant paths / ECMP / LAG
• Overlays
• Load balancers
• Firewalls
• NATs
… and it keeps on going
LAN
Server A
Client
Not
happy
ISP A
Enterprise
WAN
Server B
Internet
DNS
DHCP
802.1x
DNS
6
Why network state awareness?
• What is it:
• View of network, what it is doing, and why
• Monitoring of data network performance,
in comparison with previous working states
• Quick detection of hard failures
• Early warning for
• soft failures
• performance issues
• and tomorrows’ problems
• Faster problem resolution
• Greater confidence in network by users and application operators
7
Find the Suspects Question Suspects Improve
Be Prepared
Think Like a Network Detective
8
• Control Plane
• Processes variety of information
sources and policies, creates
routing information base (RIB)
• Best known intention w/o actual
packet in hand
• Data Plane
• The actual forwarding process
(might be SW or HW based)
• Granted some decision flexibility
• Driven by arriving packet details,
traffic conditions etc.
Control Plane & Data Plane
Control Plane
Data Plane
Int A
Int B
Int C
packet
Routing
Protocol(s)
APIs Statics
Check routes
check L3 routing
Check policy
check forwarding
Gossip from
other routers
Passive Measurements
ifmib *FlowCbQoS
check policy-map int…
check interface
check flow monitor
PfR
9
Admin Edict
• Control plane: condenses options driven by policies and (relatively) slower
moving , aggregated information, eg. prefix reachability, interface state
• Data plane responds to packet conditions
• Destination prefix to egress interface matching
• Multi-path (ECMP / LAG) member selection
• Interface congestion
• QoS class state
• Access Lists
• Packet processing fields (TTL expire, etc)
• IPv4 fragmentation, etc
Data Plane Decision Flexibility
10
• Each network device makes an independent forwarding decision
• Explicit Local / domain policies
• Device perspective might not be symmetric
• Data plane flexibility
• Generally happens at WAN-edge and admin boundaries (traffic engineering)
• Asymmetric routing
Network as a System: Independent Decisions
A B
R1 R2 R5
R6
R4
R3
your network You don’t control
Congested link
R5 is doing
ECMP hash
11
• Change is normal, but some
changes are more interesting:
• Single change that causes loss
of reachability or suboptimal
performance
• Instability: high rate of change
• 3Ws: when, where, and what
Data Plane and Control Plane Changes
Control Plane
3Ws: when, where, and what
13
14BRKARC-2025
What do I have?
• Establish inventory baseline
• Device names, IPs, configuration
• Modular HW configuration
• Serial # (for support & replacement)
• History (where has it been placed)
• Clearly label devices, ownership
and contact info
• Establish standards for location,
device/port names
• Check for changes periodically
(tooling)
<owner/dept>
<device-name>
<IP address>
<Contact>
<current-location> to
<destination-location>
<circuit src/dst id>
Example device label
Example cable label
How is it wired together?
• Establish network topology baseline
• Be prepared to be surprised!
• L2 protocol for discovery (LLDP?)
• Cisco, Foundry, Nortel
• Visual inspection J
R1 R2SW1
16
Tools for Topology & Inventory Management
• Most NMS tools have some element of inventory and topology awareness
• NetBrain
• (open source) NetDisco
http://www.netdisco.org
• (open source) Netdot
https://osl.uoregon.edu/redmine/projects/netdot
Logging
• Centrally: for ease of analysis and search
• Moogsoft - automates early detection of service failures, collaboration & knowledge base
• syslog-ng – preprocessing, relay and store(file/db)
• Logstash(ELK), fluentd – multisource collection, storage and analysis
• Locally: in case logs can’t get home
17
State of the Routing Table
• Be familiar with normal behavior of important service prefixes
• Establish quickly if problem is control plane or data plane
• Check routing table/ ipRouteTable MIB / check ip traffic (Drop stats)
• Track objects
18
#show ip route 192.168.2.2
Routing entry for 192.168.2.2/32
Known via "ospf 1", distance 110, metric 11, type intra area
Last update from 10.0.0.2 on FastEthernet0/0, 00:00:13 ago
Routing Descriptor Blocks:
* 10.0.0.2, from 2.2.2.2, 00:00:13 ago, via FastEthernet0/0
Route metric is 11, traffic share count is 1
• Remember that OSPF data in area
should be consistent
• Understand ‘normal’ rate of changes
• LSA refresh /30-min unless a change
• Track SPF runs over time
• number of LSAs expected
• OSPF-MIB: OspfSpfRuns,
ospfAreaLSACount
• Route missing?
• Where is the network supposed to be
attached? Is it still?
• check interface (on advertising router)
• Check ospf database …
OSPF Area / AS-Wide
# show ip ospf
Routing Process "ospf 1" with ID 192.168.0.1
Start time: 00:01:46.195, Time elapsed: 00:48:27.308
Supports only single TOS(TOS0) routes
Supports opaque LSA
Supports Link-local Signaling (LLS)
Supports area transit capability
Supports NSSA (compatible with RFC 3101)
Supports Database Exchange Summary List Optimization (RFC 5243)
Event-log enabled, Maximum number of events: 1000, Mode: cyclic
Router is not originating router-LSAs with maximum metric
Initial SPF schedule delay 5000 msecs
Minimum hold time between two consecutive SPFs 10000 msecs
Maximum wait time between two consecutive SPFs 10000 msecs
Incremental-SPF disabled
Minimum LSA interval 5 secs
Minimum LSA arrival 1000 msecs
LSA group pacing timer 240 secs
Interface flood pacing timer 33 msecs
Retransmission pacing timer 66 msecs
Number of external LSA 0. Checksum Sum 0x000000
Number of opaque AS LSA 0. Checksum Sum 0x000000
Number of DCbitless external and opaque AS LSA 0
Number of DoNotAge external and opaque AS LSA 0
Number of areas in this router is 1. 1 normal 0 stub 0 nssa
Number of areas transit capable is 0
External flood list length 0
IETF NSF helper support enabled
Cisco NSF helper support enabled
Reference bandwidth unit is 100 mbps
Area BACKBONE(0)
Number of interfaces in this area is 4 (1 loopback)
Area has no authentication
SPF algorithm last executed 00:47:05.379 ago
SPF algorithm executed 4 times
Area ranges are
Number of LSA 16. Checksum Sum 0x078460
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
OSPF Neighborships
• neighbor adjacencies
• Check ospf neighbor detail (OSPF-MIB: ospfNbrState, ospfNbrEvents, ospfNbrLSRetransQLen)
• How many state changes occur?
• What is the current state?
• Any retransmission happening?
• Check the interface queue
20
# show ip ospf neighbor detail
Neighbor 192.168.0.7, interface address 10.0.0.3
In the area 0 via interface GigabitEthernet0/1
Neighbor priority is 1, State is FULL, 6 state changes
DR is 10.0.0.3 BDR is 10.0.0.4
Options is 0x12 in Hello (E-bit, L-bit)
Options is 0x52 in DBD (E-bit, L-bit, O-bit)
LLS Options is 0x1 (LR)
Dead timer due in 00:00:39
Neighbor is up for 00:33:10
Index 2/2/2, retransmission queue length 0, number of retransmission 0
First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
Last retransmission scan length is 0, maximum is 0
Last retransmission scan time is 0 msec, maximum is 0 msec
BGP Monitoring Protocol (BMP) Overview
Collecting Pre-Policy BGP Messages
Adj-RIB-in (pre-inbound-filter)
BGP Monitor Protocol update
BMP collector
BMP client
Inbound
filtering
policing
Loc-RIB (post-inbound-filter)
iBGP update
BMP message
Adj-RIB-in (pre-inbound-filter)
eBGP update
BGP peer’s (external)
BGP peer
(internal)
21
• IETF RFC 7854
• BMP client (router) provides pre-policy view of the ADJ-RIB-IN of a peer
• Update messages from peer sent to BMP receiver
• Example uses:
• Realtime visualizer of BGP state
• Traffic engineering analytics
• BGP policy exploration
BGP Monitoring Protocol
22
OpenBMP
Historical record of prefix withdraws
Current route views and peer status
23
http://www.openbmp.org
Data Plane
24
3Ws: when, where, and what
User / Agent Checks
• Treat network as a black box: are your beacon services working?
• Synthetic service check (HTTP, DNS, etc.)
• Ping (not all remotes will respond)
• Data plane is exercised and tested
• Variety = better coverage (multiple IP addresses / L4 ports per location)
• Validate similar treatment (QoS) as real user traffic
• Uptime and performance (loss, latency) metrics
• Look for patterns, changes from normal. All down vs some down.
• Capture and validate real user (human) incidents. What got missed?
• Use wisely: network and server resources consumed
A B
R1 R2 R5
R6
R3
25
Latency
Network
Jitter
Dist. of
Stats Connectivity
Packet
Loss
FTP DNS DHCP TCPJitter ICMP UDPDLSW HTTP
Network
Performance
Monitoring
Service Level
Agreement
(SLA)
Monitoring
Network
Assessment
Multiprotocol
Label
Switching
(MPLS)
Monitoring
VoIP
MonitoringAvailability
Trouble
Shooting
Operations
Measurement Metrics
Uses
MIB Data Active Generated Traffic to Measure the
Network
Destination
Source
Responder
LDP H.323 SIP RTP
IP SLA
IP SLA*(RFC 6812): Synthetic Traffic Measurements
IP SLA
IP SLA
26
*IP SLA can be replaced with other monitoring tools used by other vendors such as RPM of Juniper etc
• IPSLA on router/switch –
Shadow Router?
• User end-system based
agent software
• Dedicated Agent
traceroute
• Understand the limitations
• Sends 3 packets (default) at each TTL
• Implementations
• Linux/Cisco: UDP (ICMP and TCP-SYN are Linux optional)
• UDP DST port # used to keep track of packets, increments per packet. Initial= 33434 (default)
• SRC port #: randomized (linux), incrementing per packet (Cisco IOS)
• Linux (GNU inetutils-traceroute)
• UDP DST port# increments per TTL (not per packet)
• SRC port is random but fixed per entire run
• Windows: ICMP Echo request
Widest dispersion
against possibilities.
Difficult to
understand though.
ICMP blocked
frequently L
Narrower
dispersion.
Story might be
misleading.
Internet: aka the
TCP/80 network
27
Unix traceroute
• Multiple path options
• Topology ‘shortcuts’ (same router seen at diff hop)
• Ultimately all paths result in similar e2e delay
28
$ traceroute 62.2.88.172
traceroute to 62.2.88.172 (62.2.88.172), 30 hops max, 60 byte packets
1 152.22.242.65 (152.22.242.65) 1.044 ms 1.371 ms 1.585 ms
2 152.22.240.8 (152.22.240.8) 0.219 ms 0.328 ms 0.327 ms
3 128.109.70.9 (128.109.70.9) 1.066 ms 1.059 ms 1.168 ms
4 rtp7600-gw-to-dep7600-gw2.ncren.net (128.109.70.137) 1.634 ms 1.628 ms 1.736 ms
5 rlasr-gw-link1-to-rtp7600-gw.ncren.net (128.109.9.17) 5.354 ms 5.446 ms 5.557 ms
6 128.109.9.117 (128.109.9.117) 5.671 ms 128.109.9.170 (128.109.9.170) 7.141 ms 128.109.9.117 (128.109.9.117) 5.433 ms
7 wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net (128.109.1.105) 9.174 ms 128.109.1.209 (128.109.1.209) 8.256 ms 6.397 ms
8 dcp-brdr-03.inet.qwest.net (205.171.251.110) 18.414 ms chr-edge-03.inet.qwest.net (65.114.0.205) 27.353 ms 27.438 ms
9 dcp-brdr-03.inet.qwest.net (205.171.251.110) 21.739 ms 63-235-40-106.dia.static.qwest.net (63.235.40.106) 17.750 ms
dcp-brdr-03.inet.qwest.net (205.171.251.110) 22.450 ms
10 63-235-40-106.dia.static.qwest.net (63.235.40.106) 22.531 ms 22.516 ms 84-116-130-173.aorta.net (84.116.130.173) 140.738 ms
11 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 140.831 ms 140.816 ms 84-116-130-173.aorta.net (84.116.130.173) 144.819 ms
12 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 144.074 ms 144.761 ms 84-116-130-58.aorta.net (84.116.130.58) 138.455 ms
13 84-116-130-58.aorta.net (84.116.130.58) 141.844 ms 141.924 ms 142.459 ms
14 84.116.204.234 (84.116.204.234) 145.603 ms 145.891 ms 145.987 ms
15 * * *
16 62-2-88-172.static.cablecom.ch (62.2.88.172) 268.281 ms 268.245 ms 268.176 ms
1 AAA
2 BBB
3 CCC
4 DDD
5 EEE
6 FGF
7 HII
8 JKK +10ms (unsustained)
9 JLJ
10 LLM +120ms (sustained)
11 NNM
12 NNO
13 PPP
14 QQQ
15 ***
16 RRR ~268ms (all three)
filter + > 100 ms
delay
+120ms
Atlantic
crossing
Reference
Unix inetutils traceroute
• Narrower view (no alternate paths directly seen)
• Repeating nodes suggests multipath, or (unlikely) routing issue
29
$ inetutils-traceroute --resolve-hostname 62.2.88.172
traceroute to 62.2.88.172 (62.2.88.172), 64 hops max
1 152.22.242.65 (152.22.242.65) 0.783ms 0.727ms 0.798ms
2 152.22.240.8 (152.22.240.8) 0.226ms 0.228ms 0.221ms
3 128.109.70.9 (128.109.70.9) 0.967ms 0.980ms 0.962ms
4 128.109.70.137 (rtp7600-gw-to-dep7600-gw2.ncren.net) 1.576ms 1.598ms 1.567ms
5 128.109.9.17 (rlasr-gw-link1-to-rtp7600-gw.ncren.net) 5.149ms 5.140ms 5.126ms
6 128.109.9.166 (128.109.9.166) 7.113ms 7.098ms 7.306ms
7 128.109.1.209 (128.109.1.209) 7.835ms 8.326ms 7.958ms
8 65.114.0.205 (chr-edge-03.inet.qwest.net) 19.944ms 9.299ms 40.372ms
9 63.235.40.106 (63-235-40-106.dia.static.qwest.net) 18.442ms 18.412ms 18.432ms
10 63.235.40.106 (63-235-40-106.dia.static.qwest.net) 22.424ms 22.391ms 75.960ms
11 84.116.130.173 (84-116-130-173.aorta.net) 145.434ms 146.301ms 145.445ms
12 84.116.130.58 (84-116-130-58.aorta.net) 137.583ms 137.556ms 137.661ms
13 84.116.130.58 (84-116-130-58.aorta.net) 142.476ms 141.886ms 141.819ms
14 84.116.204.234 (84.116.204.234) 144.841ms 145.034ms 144.964ms
15 * * *
16 62.2.88.172 (62-2-88-172.static.cablecom.ch) 287.318ms 176.670ms 254.237ms
Packets for hop 9,12 took a
‘shortcut’ and packets for
hop 10,13 went long way
Reference
LFT
• lft ‘layer 4 traceroute’ dynamically adjusts to responses
• Firewall detection, whois and AS lookup integrated
• Narrower packet changes, so narrower multi-path
30
$ sudo lft -ENA 62.2.88.172
Tracing ________________________________________________________________.
TTL LFT trace to 62-2-88-172.static.cablecom.ch (62.2.88.172):80/tcp
1 [AS81] [NCREN-B22] 152.22.242.65 20.1/17.2ms
2 [AS81] [NCREN-B22] 152.22.240.8 20.1/20.1ms
3 [AS81] [CONCERT] 128.109.70.9 20.1/20.1ms
4 [AS81] [CONCERT] rtp7600-gw-to-dep7600-gw2.ncren.net (128.109.70.137) 20.1/20.1ms
5 [AS81] [CONCERT] rlasr-gw-link1-to-rtp7600-gw.ncren.net (128.109.9.17) 20.1/20.1ms
6 [AS81] [CONCERT] 128.109.9.117 20.1/20.1ms
7 [AS209] [unknown] chr-edge-03.inet.qwest.net (65.121.156.209) 20.1/19.5ms
8 [AS209] [QWEST-INET-35] dcp-brdr-03.inet.qwest.net (205.171.251.110) 20.1/18.4ms
9 [AS209] [QWEST-INET-17] 63-235-40-106.dia.static.qwest.net (63.235.40.106) 20.1/60.3ms
10 [AS6830] [84-RIPE/LGI-Infrastructure] 84-116-130-173.aorta.net (84.116.130.173) 160.7/160.7ms
11 [AS6830] [84-RIPE/LGI-Infrastructure] nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 160.7/160.7ms
12 [AS6830] [84-RIPE/LGI-Infrastructure] 84-116-130-58.aorta.net (84.116.130.58) 140.6/140.6ms
** [firewall] the next gateway may statefully inspect packets
13 [AS6830] [84-RIPE/LGI-Infrastructure] 84.116.204.234 160.7/160.6ms
** [neglected] no reply packets received from TTL 14
15 * [AS6830] [RIPE-C3/CC-HO841-NET] [target] 62-2-88-172.static.cablecom.ch (62.2.88.172):80 160.7ms
Used tcp/80
SYN
Reference
MTR
• Interactive combined traceroute and ping
• Gives a sense of health of path (loss, delay Standard Deviation)
• Narrow path view
31
Reference
$ mtr 62.2.88.172
aakhter-nlr-ubuntu-01 (0.0.0.0) Sat May 30 18:57:09 2015
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 152.22.242.65 0.0% 145 0.8 0.9 0.7 10.0 0.8
2. 152.22.240.8 0.0% 145 0.3 0.2 0.2 0.3 0.0
3. 128.109.70.9 0.0% 145 1.0 3.3 1.0 182.3 17.2
4. rtp7600-gw-to-dep7600-gw2.ncren.net 1.0% 145 9.2 4.1 1.6 203.4 18.6
5. rlasr-gw-link1-to-rtp7600-gw.ncren.net 0.0% 145 5.3 5.3 5.1 6.8 0.2
6. 128.109.9.166 0.0% 145 7.1 7.3 7.1 16.1 0.8
7. wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net 0.0% 145 6.8 8.3 6.2 10.6 1.0
8. chr-edge-03.inet.qwest.net 0.0% 145 9.4 12.3 9.3 62.1 9.5
9. dcp-brdr-03.inet.qwest.net 0.0% 145 21.8 22.8 21.7 70.7 5.5
10. 63-235-40-106.dia.static.qwest.net 0.0% 145 21.8 24.5 21.7 86.1 10.6
11. 84-116-130-173.aorta.net 0.0% 145 144.8 145.0 144.7 152.9 1.0
12. nl-ams02a-rd1-te0-2-0-2.aorta.net 0.0% 145 144.1 145.5 144.0 165.4 3.7
13. 84-116-130-58.aorta.net 5.0% 144 142.9 142.3 142.0 145.6 0.4
14. 84.116.204.234 5.0% 144 145.1 145.1 144.9 145.3 0.0
15. 217-168-62-150.static.cablecom.ch 5.0% 144 145.9 146.1 145.2 164.3 1.9
16. 62-2-88-172.static.cablecom.ch 5.0% 144 313.0 260.3 152.6 508.0 80.0
Note
variability,
probably just
the end
system
Just local noise, no
carry over to later
hops Sustained loss.
Likely something
wrong 12->13, or
way back
Check interface
• Classic command
• Check interface ‘up’ status
• Stability: check log event or check
routing table stability
• Monitor in/out bit/packet changes
# show interface
GigabitEthernet1 is up, line protocol is up
Hardware is CSR vNIC, address is 000c.291a.7f97 (bia
000c.291a.7f97)
Internet address is 192.168.225.130/24
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full Duplex, 1000Mbps, link type is auto, media type is RJ45
output flow-control is unsupported, input flow-control is
unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:05:35, output 00:09:58, output hang never
Last clearing of "show interface" counters never
Input queue: 0/375/0/0 (size/max/drops/flushes); Total output
drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
25349 packets input, 2381158 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
3958 packets output, 312408 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
56 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out
Follow the Flow with NetFlow(RFC 3954)
• Per-Node: Data plane observations and decisions captured
• Src/dst mac/IP/port#s, DSCP values, in/out interfaces, etc.
• Network view: flows centrally analyzed- NetFlow collector/analyzer
• Biggest value: strategically placed partial views
(eg WAN edge)
33
A B
R1 R2 R5
R6
R4
R3
NetFlow Collector
LiveAction
• Developed and patented at Cisco
Systems in 1996
• NetFlow is the de facto standard for
acquiring IP operational data
• Standardized in IETF via IPFIX
(RFC 7011)
• Provides network and security
monitoring, network planning, traffic
analysis, and IP accounting
• Packet capture is like a wire tap
• NetFlow is like a phone bill
NetFlow(RFC 3954)—What Is It?
Network World Article—NetFlow Adoption on the Rise
http://www.networkworld.com/newsletters/nsm/2005/0314nsm1.html 34
Src.
IP
Dest.
IP
Source
Port
Dest.
Port
Protocol TOS
Input
I/F
… Pkts
3.3.3.3 2.2.2.2 23 22078 6 0 E0 … 1100
Traffic Analysis Cache
Flow
Monitor 1
Traffic
Non-Key Fields
Packets
Bytes
Timestamps
Next Hop Address
Source IP Dest. IP Input I/F Flag … Pkts
3.3.3.3 2.2.2.2 E0 0 … 11000
Security Analysis Cache
Flow
Monitor 2
Key Fields Packet 1
Source IP 3.3.3.3
Dest IP 2.2.2.2
Input Interface Ethernet 0
SYN Flag 0
Non-Key Fields
Packets
Timestamps
Flexible NetFlow
Multiple Monitors with Unique Key Fields
Key Fields Packet 1
Source IP 3.3.3.3
Destination IP 2.2.2.2
Source Port 23
Destination Port 22078
Layer 3 Protocol TCP - 6
TOS Byte 0
Input Interface Ethernet 0
35
• Flexible NetFlow Forwarding
Status field captures
forwarding (and drop reason)
for flow.
• Drop Count increments on any
explicit drop by router
NetFlow Forwarding Status & Drop Count Fields
RFC 7270
36
Network nodes are able to discover & validate RTP, TCP and IP-CBR traffic on hop by hop
basis
À la carte metric (loss, latency, jitter etc.) selections, applied on operator selected sets of traffic
Allows for fault isolation and network span validation
Per-application threshold and altering.
Network Performance Monitor
37
• RTP SSRC
• RTP Jitter (min/max/mean)
• Transport Counter (expected/loss)
• Media Counter
(bytes/packets/rate)
• Media Event
• Collection interval
• TCP MSS
• TCP round-trip time
Performance Monitor Information Elements
• CND - Client Network Delay (min/max/sum)
• SND – Server Network Delay (min/max/sum)
• ND – Network Delay (min/max/sum)
• AD – Application Delay (min/max/sum)
• Total Response Time (min/max/sum)
• Total Transaction Time (min/max/sum)
• Number of New Connections
• Number of Late Responses
• Number of Responses by Response Time (7-
bucket histogram)
• Number of Retransmissions
• Number of Transactions
• Client/Server Bytes
• Client/Server Packets
• L3 counter (bytes/packets)
• Flow event
• Flow direction
• Client and server address
• Source and destination address
• Transport information
• Input and output interfaces
• L3 information (TTL, DSCP,
TOS, etc.)
• Application information (from
deep packet inspection tool)
• Monitoring class hierarchy
Media Monitoring Application Response Time Other Metrics
38
NetFlow QoS Analysis
39
Cisco Prime Infra
LiveAction
flow 5-tuple DPI/NBAR QoS processing DSCP
How is my flow being classified?
Did this QoS class drop traffic?
Dedicated Protocol Analyzers
• Wireshark and other protocol analyzers are great
• Detailed analysis for variety of protocols at deep level
• Dedicated probes are expensive to deploy pervasively
• Operator has to make difficult judgment calls on where the problem is going to be– before it
happens
• Can be challenging after the fact- need on-site trained personnel.
40
Embedded Packet Capture & Analyze
• Capture packets locally to buffer on router
• Store to flash, USB, FTP, TFTP for analysis in protocol analyzer
• Capture does not add traffic to network
LY-2851-8#monitor capture buffer pcap-buffer1 size 10000 max-size 1550
LY-2851-8#monitor capture point ip cef pcap-point1 g0/0 both
LY-2851-8#monitor capture point associate pcap-point1 pcap-buffer1
LY-2851-8#monitor capture point start pcap-point1
LY-2851-8#monitor capture point stop pcap-point1
LY-2851-8#monitor capture buffer pcap-buffer1 export ftp://10.17.0.252/images/test.cap
Gig0/0
in-band OAM for IPv6 (iOAM6)
• New IPv6 extension header defined on user packets
• vs. IPv4 record-route option header
• RFC 2460 does not define an option to record the route
• Minimal performance hit (handled in data plane)
• Packets continue on regular path
• Instrumentation
• Packet sequence numbers => detect packet loss
• Time stamps => one way delay
• Node and ingress/egress interface names => path recording
• draft-brockners-inband-oam-requirements-03
42
Network
Element
Apps/Controller
v6 traffic
matrix
Live flow
tracing
Delay
distribution
Bi-castíng
control
Loss matrix/
monitor
App data
monitoring
Enhanced Telemetry
Per hop and end-to-end data added to
(selected) data traffic into the packet
Node-ID Ingress i/f egress i/f
Sequence# Timestamp App-Data
iOAM6 Path Trace
• Extended Ping
H1#ping
Protocol [ip]: ipv6
Target IPv6 address: ::A:1:1:0:1D
Repeat count [5]: 1
Datagram size [100]: 300
Timeout in seconds [2]:
Extended commands? [no]: yes
Source address or interface: gig0/1
UDP protocol? [no]:
Verbose? [no]: yes
Precedence [0]:
DSCP [0]:
Include hop by hop Path Record option? [no]: yes
Sweep range of sizes? [no]:
Type escape sequence to abort.
Sending 1, 300-byte ICMP Echos to ::A:1:1:0:1D, timeout is 2 seconds:
(Gi0/1)R1(Gi0/2)----(Gi0/1)R4(Gi0/2)----(Gi0/2)R3(Gi0/3)----H3----(Gi0/3)R3(Gi0/2)----(Gi0/2)R4(Gi0/1)----(Gi0/2)R1(Gi0/1)
Reply to request 0 (35 ms)
Success rate is 100 percent (1/1), round-trip min/avg/max = 35/35/35 ms
H1 R1 R3
H3
::A:1:1:0:1D
R2
R4
43
V6 extension
header
applied/decapped
V6 extension
header
applied/decapped
End system ICMP
stack iOAM6 enabled
Getting Started
44
Be Prepared!
• Be prepared and have data collection systems enabled
• Enable passive monitoring on endpoints and network
• Enable active tests
• Helpdesk
• Interview Script => establish & maintain checklists
• Multi-group access to tools, logs, etc.
• Firefighters run drills, so should your teams!
• Be familiar with the tools and how they respond on your network
• Red phone: Cross-domain teams (applications, UC, security, servers)
45
Expanding your Toolbox and Knowledge
• Commercial and open source tools to look at
• Network topology & IP address management: netdot, GestióIP
• Performance tests: iperf3, netperf
• Service checks: Nagios Core, Zenoss Core
• NetFlow / Log analysis/moniroting: logstash, fluentd, splunk
• Template driven config generation: ansible
• Control Plane Troubleshooting (Troubleshooting IP Routing
Protocols
46
Network Documentation Tool (netdot)
• Open source
• Started in 2002
• Network interfaces discovery via SNMP
• Discovery of L2 & L3 devices
• Dynamically draws a diagram/topology of your network
• Management of IPv4 and IPv6 address via IPAM
47
Network Documentation Tool (netdot)
48
GestióIP
• Web based IP Management software
• Concurrent users support
• Better search and filter capabilities than traditional spreadsheet
• Better statistical data
• Less chance of human error compare to spreadsheet
• Migration assistance from managed IPv4 to IPv6 via tool
49
GestióIP
50
iperf3
• Active measurement tool to discover available path capacity
• worst link and worst host configurations
• Test can be in either direction (only static NAT works)
• TCP (retransmissions, rate, cwd), SCTP and UDP (loss, jitter, out of order) tests
51
sender receiver
TCP/5201
Test traffic: TCP,
SCTP, UDP
∫∫∫∫∫∫∫
$ bwctl -T iperf3 -t 30 -O 4 -s "56m-ps-4x10.sox.net:4823"
bwctl: Using tool: iperf3
bwctl: 40 seconds until test results available
SENDER START
Connecting to host 152.22.242.103, port 5160
[ 15] local 143.215.194.123 port 45609 connected to 152.22.242.103 port 5160
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 15] 0.00-1.00 sec 107 MBytes 898 Mbits/sec 0 3.06 MBytes (omitted)
[ 15] 1.00-2.00 sec 112 MBytes 944 Mbits/sec 0 3.06 MBytes (omitted)
…
[ 15] 29.00-30.00 sec 112 MBytes 944 Mbits/sec 0 3.06 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 15] 0.00-30.00 sec 3.29 GBytes 942 Mbits/sec 0 sender
[ 15] 0.00-30.00 sec 3.29 GBytes 943 Mbits/sec receiver
iperf Done.
SENDER END
Iperf3
examples
$ $ bwctl -T iperf3 -t 30 -O 4 -c "56m-ps-4x10.sox.net:4823"
bwctl: Using tool: iperf3
bwctl: 39 seconds until test results available
SENDER START
Connecting to host 143.215.194.123, port 5327
[ 15] local 152.22.242.103 port 44855 connected to 143.215.194.123 port 5327
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 15] 0.00-1.00 sec 5.14 MBytes 43.1 Mbits/sec 411 25.5 KBytes (omitted)
[ 15] 1.00-2.00 sec 2.26 MBytes 19.0 Mbits/sec 15 19.8 KBytes (omitted)
…
[ 15] 28.00-29.00 sec 2.26 MBytes 18.9 Mbits/sec 16 25.5 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 15] 0.00-30.00 sec 59.8 MBytes 16.7 Mbits/sec 539 sender
[ 15] 0.00-30.00 sec 60.7 MBytes 17.0 Mbits/sec receiver
iperf Done.
SENDER END
Client to server
(local to remote)
Throw away stats
from first 4 sec
Run for 30 sec
~19mbps (local to
remote)
retransmissions
~940 mbps (remote
to local)
Use –P for parallel
streams
∫∫∫∫∫∫∫
• Similar to iperf3 but:
• Works bidirectionally in a
NAT environment
• additional connection/per
second and transaction/per
second tests
• statistical confidence
intervals (-I)
netperf
> netperf -t TCP_STREAM -H 162.209.79.211 -i 30,10 -I 95,5 -j -l 60
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 162.209.79.211 ()
port 0 AF_INET : +/-2.500% @ 95% conf. : demo
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 8.965%
!!! Local CPU util : 0.000%
!!! Remote CPU util : 0.000%
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 60.52 13.91
download
Nagios Core
54
• Monitoring and alerting engine
• Open source written in C
• Flexible and scalable architecture
• Event scheduler, event processor and alert manager
• APIs can be used to extend the capabilities to perform additional tasks
• Designed for Unix & Linux systems
Nagios Core
55
Zenoss Core
56
• Network Management System
• Open source
• Written in 90% Python, 10% Java
• GNU based license
• Web based interface for monitoring
• Used by Govt sector, Retails, financial institutions, SP and tech industry
Zenoss Core
57
logstash
58
• Open source tool for managing events and logs
• Data/logs Collection(many sources), filter and display logs
• Scalable data processing
• Analysis, Archiving, Monitoring & Alerting
• Elasticsearch API is used for storage
• Kibana (a broswer based analytics) is developed to view Logstash data
fluentd
59
• Open source data collector
• Written in C and Ruby
• Requires very little system resource
• Simple and flexible/extensible
• 5000+ companies are using fluentd for data collection
• Provides unified logging layer
• Decouples data sources from backend systems
ansible
60
• Automates software provisioning and configuration management
• Use for application deployment/migration
• Reduces complexity and repetition
• Comes with Fedora distribution of Linux
• Used for Linux, Unix and Windows
• Written in python and powershell
• Can be used in the cloud environment also such as AWS, Azure etc.
Splunk
61
• Accessible via standard browser or via mobile app
• Collects and index data
• Powerful statistical search of the data
• Correlate and investigate between events and activities
• Display reports in a customized dashboard
• Turn searches into real time alert
• Notifies via email or RSS
Splunk
62
63
Alerting & Collaboration
• Routing of alerts / interesting events
• Is this noise or signal?
• Which team(s) to alert?
• Who is on duty?
• How to contact: SMS, IM, phone call…
• Pagerduty, Openduty
• Coordinating response
• IM tools (Spark, hipchat etc.)
• Email
• Ticketing tools (OTRS, Jira,
ServiceNow, Moogsoft…)
PagerDuty
Thank You
64

More Related Content

PPT
Performance test
PPTX
Wireshark
PDF
Wireshark ppt
PPTX
Packet Analysis - Course Technology Computing Conference
PPTX
Network Packet Analysis with Wireshark
PPTX
Loadbalancing In-depth study for scale @ 80K TPS
PPT
Design device driver for wireless device using 32 bit microcontroller
PPT
Wireshark Basics
Performance test
Wireshark
Wireshark ppt
Packet Analysis - Course Technology Computing Conference
Network Packet Analysis with Wireshark
Loadbalancing In-depth study for scale @ 80K TPS
Design device driver for wireless device using 32 bit microcontroller
Wireshark Basics

What's hot (20)

PDF
4. Communication and Network Security
PPTX
Packet analyzing with wireshark-basic of packet analyzing - Episode_02
PPT
ACIT Mumbai - OSI Model
PPTX
Sanitizing PCAPs
PPT
I pv6 routing_protocol_for_low_power_and_lossy_
PPTX
CapAnalysis - Deep Packet Inspection
PPTX
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
PPTX
Spy hard, challenges of 100G deep packet inspection on x86 platform
PPT
Wireshark
PDF
Ch 2: TCP/IP Concepts Review
PPT
SSL basics and SSL packet analysis using wireshark
PDF
Network analysis Using Wireshark Lesson 11: TCP and UDP Analysis
PPTX
Future Internet protocols
PDF
Linux Linux Traffic Control
PPTX
Packet analysis using wireshark
PDF
Frenetic: A Programming Language for OpenFlow Networks
PPTX
Packet Framework - Cristian Dumitrescu
PDF
Network Analysis Using Wireshark -10- arp and ip analysis
PDF
RPL - Routing Protocol for Low Power and Lossy Networks
PDF
Open Flow Tutorial Series - Set 1
4. Communication and Network Security
Packet analyzing with wireshark-basic of packet analyzing - Episode_02
ACIT Mumbai - OSI Model
Sanitizing PCAPs
I pv6 routing_protocol_for_low_power_and_lossy_
CapAnalysis - Deep Packet Inspection
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Spy hard, challenges of 100G deep packet inspection on x86 platform
Wireshark
Ch 2: TCP/IP Concepts Review
SSL basics and SSL packet analysis using wireshark
Network analysis Using Wireshark Lesson 11: TCP and UDP Analysis
Future Internet protocols
Linux Linux Traffic Control
Packet analysis using wireshark
Frenetic: A Programming Language for OpenFlow Networks
Packet Framework - Cristian Dumitrescu
Network Analysis Using Wireshark -10- arp and ip analysis
RPL - Routing Protocol for Low Power and Lossy Networks
Open Flow Tutorial Series - Set 1
Ad

Similar to Tutorial: Network State Awareness Troubleshooting (20)

PDF
Network State Awareness & Troubleshooting
PPTX
LinkedIn's Approach to Programmable Data Center
PPTX
Network protocols and vulnerabilities
PDF
Co se skrývá v datovém provozu? - Pavel Minařík
PPTX
Software Defined Networking: Primer
PDF
RouteFlow & IXPs
PDF
Security defined routing_cybergamut_v1_1
PDF
LinkedIn OpenFabric Project - Interop 2017
PPTX
Building a Router
PPTX
TechWiseTV Workshop: Segment Routing for the Datacenter
PPTX
Flink Streaming @BudapestData
PDF
Решения WANDL и NorthStar для операторов
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Approved MikroTik training programs and certificates outlines
PPTX
Networking revolution
PDF
Protocol and Integration Challenges for SDN
PPTX
Network Layer
PDF
Toolkit Titans - Crafting a Cutting-Edge, Open-Source Security Operations Too...
PDF
Kentik Detect Engine - Network Field Day 2017
Network State Awareness & Troubleshooting
LinkedIn's Approach to Programmable Data Center
Network protocols and vulnerabilities
Co se skrývá v datovém provozu? - Pavel Minařík
Software Defined Networking: Primer
RouteFlow & IXPs
Security defined routing_cybergamut_v1_1
LinkedIn OpenFabric Project - Interop 2017
Building a Router
TechWiseTV Workshop: Segment Routing for the Datacenter
Flink Streaming @BudapestData
Решения WANDL и NorthStar для операторов
Stephan Ewen - Experiences running Flink at Very Large Scale
Approved MikroTik training programs and certificates outlines
Networking revolution
Protocol and Integration Challenges for SDN
Network Layer
Toolkit Titans - Crafting a Cutting-Edge, Open-Source Security Operations Too...
Kentik Detect Engine - Network Field Day 2017
Ad

More from APNIC (20)

PPTX
APNIC Report, presented at APAN 60 by Thy Boskovic
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PDF
DNSSEC Made Easy, presented at PHNOG 2025
PDF
BGP Security Best Practices that Matter, presented at PHNOG 2025
PDF
APNIC's Role in the Pacific Islands, presented at Pacific IGF 2205
PDF
IPv6 Deployment and Best Practices, presented by Makito Lay
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
PDF
The Internet - By the numbers, presented at npNOG 11
PDF
Transmission Control Protocol (TCP) and Starlink
PDF
DDoS in India, presented at INNOG 8 by Dave Phelan
PDF
Global Networking Trends, presented at the India ISP Conclave 2025
PDF
Make DDoS expensive for the threat actors
PDF
Fast Reroute in SR-MPLS, presented at bdNOG 19
PDF
DDos Mitigation Strategie, presented at bdNOG 19
PDF
ICP -2 Review – What It Is, and How to Participate and Provide Your Feedback
PDF
APNIC Update - Global Synergy among the RIRs: Connecting the Regions
PDF
Measuring Starlink Protocol Performance, presented at LACNIC 43
APNIC Report, presented at APAN 60 by Thy Boskovic
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
RPKI Status Update, presented by Makito Lay at IDNOG 10
The Internet -By the Numbers, Sri Lanka Edition
Triggering QUIC, presented by Geoff Huston at IETF 123
DNSSEC Made Easy, presented at PHNOG 2025
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC's Role in the Pacific Islands, presented at Pacific IGF 2205
IPv6 Deployment and Best Practices, presented by Makito Lay
Cleaning up your RPKI invalids, presented at PacNOG 35
The Internet - By the numbers, presented at npNOG 11
Transmission Control Protocol (TCP) and Starlink
DDoS in India, presented at INNOG 8 by Dave Phelan
Global Networking Trends, presented at the India ISP Conclave 2025
Make DDoS expensive for the threat actors
Fast Reroute in SR-MPLS, presented at bdNOG 19
DDos Mitigation Strategie, presented at bdNOG 19
ICP -2 Review – What It Is, and How to Participate and Provide Your Feedback
APNIC Update - Global Synergy among the RIRs: Connecting the Regions
Measuring Starlink Protocol Performance, presented at LACNIC 43

Recently uploaded (20)

PPTX
international classification of diseases ICD-10 review PPT.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PPTX
E -tech empowerment technologies PowerPoint
PPTX
Introduction to Information and Communication Technology
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPT
tcp ip networks nd ip layering assotred slides
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
international classification of diseases ICD-10 review PPT.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
E -tech empowerment technologies PowerPoint
Introduction to Information and Communication Technology
Decoding a Decade: 10 Years of Applied CTI Discipline
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
Slides PDF The World Game (s) Eco Economic Epochs.pdf
tcp ip networks nd ip layering assotred slides
Job_Card_System_Styled_lorem_ipsum_.pptx
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
WebRTC in SignalWire - troubleshooting media negotiation
522797556-Unit-2-Temperature-measurement-1-1.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Cloud-Scale Log Monitoring _ Datadog.pdf

Tutorial: Network State Awareness Troubleshooting

  • 1. Network State Awareness and Troubleshooting Faraz Shamim Aamer Akhter Technical Leader Director Product Management Cisco Systems Inc
  • 2. • Troubleshooting Methodology • Packet Forwarding Review • Control Plane • Active Monitoring • Logging • Routing Protocol Stability • Data Plane • Active Monitoring • Passive Flow Monitoring • QoS • Getting Started Agenda
  • 3. • This session is about basic network troubleshooting, focusing on fault detection & isolation • Mostly, vendor neutral • For context, we will cover some basic methodologies and functional elements of network behavior • This session is NOT about • Architectures of specific platforms • Data Center technologies • Routing Protocols Troubleshooting • This is a 90 min tour. ;-) Keeping Focused: What This Session is About 3
  • 4. The Big Picture network Network Operator Server Client Application Operator Not happy It’s not the network It’s the network Is it Monday? Pings fine! Can’t ping it. Internet’s down. Somebody's downloading something. (?) 4
  • 5. Enterprise DC • A lot of stuff going on • Multiple networks • Multiple applications • Multiple layered services • Mis-information / inconsistency Some More (network) Detail LAN Server A Client Not happy ISP A Enterprise WAN Server B Internet DNS DHCP 802.1x DNS 5
  • 6. ISP B Enterprise DC • Redundant paths / ECMP / LAG • Overlays • Load balancers • Firewalls • NATs … and it keeps on going LAN Server A Client Not happy ISP A Enterprise WAN Server B Internet DNS DHCP 802.1x DNS 6
  • 7. Why network state awareness? • What is it: • View of network, what it is doing, and why • Monitoring of data network performance, in comparison with previous working states • Quick detection of hard failures • Early warning for • soft failures • performance issues • and tomorrows’ problems • Faster problem resolution • Greater confidence in network by users and application operators 7
  • 8. Find the Suspects Question Suspects Improve Be Prepared Think Like a Network Detective 8
  • 9. • Control Plane • Processes variety of information sources and policies, creates routing information base (RIB) • Best known intention w/o actual packet in hand • Data Plane • The actual forwarding process (might be SW or HW based) • Granted some decision flexibility • Driven by arriving packet details, traffic conditions etc. Control Plane & Data Plane Control Plane Data Plane Int A Int B Int C packet Routing Protocol(s) APIs Statics Check routes check L3 routing Check policy check forwarding Gossip from other routers Passive Measurements ifmib *FlowCbQoS check policy-map int… check interface check flow monitor PfR 9 Admin Edict
  • 10. • Control plane: condenses options driven by policies and (relatively) slower moving , aggregated information, eg. prefix reachability, interface state • Data plane responds to packet conditions • Destination prefix to egress interface matching • Multi-path (ECMP / LAG) member selection • Interface congestion • QoS class state • Access Lists • Packet processing fields (TTL expire, etc) • IPv4 fragmentation, etc Data Plane Decision Flexibility 10
  • 11. • Each network device makes an independent forwarding decision • Explicit Local / domain policies • Device perspective might not be symmetric • Data plane flexibility • Generally happens at WAN-edge and admin boundaries (traffic engineering) • Asymmetric routing Network as a System: Independent Decisions A B R1 R2 R5 R6 R4 R3 your network You don’t control Congested link R5 is doing ECMP hash 11
  • 12. • Change is normal, but some changes are more interesting: • Single change that causes loss of reachability or suboptimal performance • Instability: high rate of change • 3Ws: when, where, and what Data Plane and Control Plane Changes
  • 13. Control Plane 3Ws: when, where, and what 13
  • 14. 14BRKARC-2025 What do I have? • Establish inventory baseline • Device names, IPs, configuration • Modular HW configuration • Serial # (for support & replacement) • History (where has it been placed) • Clearly label devices, ownership and contact info • Establish standards for location, device/port names • Check for changes periodically (tooling) <owner/dept> <device-name> <IP address> <Contact> <current-location> to <destination-location> <circuit src/dst id> Example device label Example cable label
  • 15. How is it wired together? • Establish network topology baseline • Be prepared to be surprised! • L2 protocol for discovery (LLDP?) • Cisco, Foundry, Nortel • Visual inspection J R1 R2SW1
  • 16. 16 Tools for Topology & Inventory Management • Most NMS tools have some element of inventory and topology awareness • NetBrain • (open source) NetDisco http://www.netdisco.org • (open source) Netdot https://osl.uoregon.edu/redmine/projects/netdot
  • 17. Logging • Centrally: for ease of analysis and search • Moogsoft - automates early detection of service failures, collaboration & knowledge base • syslog-ng – preprocessing, relay and store(file/db) • Logstash(ELK), fluentd – multisource collection, storage and analysis • Locally: in case logs can’t get home 17
  • 18. State of the Routing Table • Be familiar with normal behavior of important service prefixes • Establish quickly if problem is control plane or data plane • Check routing table/ ipRouteTable MIB / check ip traffic (Drop stats) • Track objects 18 #show ip route 192.168.2.2 Routing entry for 192.168.2.2/32 Known via "ospf 1", distance 110, metric 11, type intra area Last update from 10.0.0.2 on FastEthernet0/0, 00:00:13 ago Routing Descriptor Blocks: * 10.0.0.2, from 2.2.2.2, 00:00:13 ago, via FastEthernet0/0 Route metric is 11, traffic share count is 1
  • 19. • Remember that OSPF data in area should be consistent • Understand ‘normal’ rate of changes • LSA refresh /30-min unless a change • Track SPF runs over time • number of LSAs expected • OSPF-MIB: OspfSpfRuns, ospfAreaLSACount • Route missing? • Where is the network supposed to be attached? Is it still? • check interface (on advertising router) • Check ospf database … OSPF Area / AS-Wide # show ip ospf Routing Process "ospf 1" with ID 192.168.0.1 Start time: 00:01:46.195, Time elapsed: 00:48:27.308 Supports only single TOS(TOS0) routes Supports opaque LSA Supports Link-local Signaling (LLS) Supports area transit capability Supports NSSA (compatible with RFC 3101) Supports Database Exchange Summary List Optimization (RFC 5243) Event-log enabled, Maximum number of events: 1000, Mode: cyclic Router is not originating router-LSAs with maximum metric Initial SPF schedule delay 5000 msecs Minimum hold time between two consecutive SPFs 10000 msecs Maximum wait time between two consecutive SPFs 10000 msecs Incremental-SPF disabled Minimum LSA interval 5 secs Minimum LSA arrival 1000 msecs LSA group pacing timer 240 secs Interface flood pacing timer 33 msecs Retransmission pacing timer 66 msecs Number of external LSA 0. Checksum Sum 0x000000 Number of opaque AS LSA 0. Checksum Sum 0x000000 Number of DCbitless external and opaque AS LSA 0 Number of DoNotAge external and opaque AS LSA 0 Number of areas in this router is 1. 1 normal 0 stub 0 nssa Number of areas transit capable is 0 External flood list length 0 IETF NSF helper support enabled Cisco NSF helper support enabled Reference bandwidth unit is 100 mbps Area BACKBONE(0) Number of interfaces in this area is 4 (1 loopback) Area has no authentication SPF algorithm last executed 00:47:05.379 ago SPF algorithm executed 4 times Area ranges are Number of LSA 16. Checksum Sum 0x078460 Number of opaque link LSA 0. Checksum Sum 0x000000 Number of DCbitless LSA 0 Number of indication LSA 0 Number of DoNotAge LSA 0 Flood list length 0
  • 20. OSPF Neighborships • neighbor adjacencies • Check ospf neighbor detail (OSPF-MIB: ospfNbrState, ospfNbrEvents, ospfNbrLSRetransQLen) • How many state changes occur? • What is the current state? • Any retransmission happening? • Check the interface queue 20 # show ip ospf neighbor detail Neighbor 192.168.0.7, interface address 10.0.0.3 In the area 0 via interface GigabitEthernet0/1 Neighbor priority is 1, State is FULL, 6 state changes DR is 10.0.0.3 BDR is 10.0.0.4 Options is 0x12 in Hello (E-bit, L-bit) Options is 0x52 in DBD (E-bit, L-bit, O-bit) LLS Options is 0x1 (LR) Dead timer due in 00:00:39 Neighbor is up for 00:33:10 Index 2/2/2, retransmission queue length 0, number of retransmission 0 First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0) Last retransmission scan length is 0, maximum is 0 Last retransmission scan time is 0 msec, maximum is 0 msec
  • 21. BGP Monitoring Protocol (BMP) Overview Collecting Pre-Policy BGP Messages Adj-RIB-in (pre-inbound-filter) BGP Monitor Protocol update BMP collector BMP client Inbound filtering policing Loc-RIB (post-inbound-filter) iBGP update BMP message Adj-RIB-in (pre-inbound-filter) eBGP update BGP peer’s (external) BGP peer (internal) 21
  • 22. • IETF RFC 7854 • BMP client (router) provides pre-policy view of the ADJ-RIB-IN of a peer • Update messages from peer sent to BMP receiver • Example uses: • Realtime visualizer of BGP state • Traffic engineering analytics • BGP policy exploration BGP Monitoring Protocol 22
  • 23. OpenBMP Historical record of prefix withdraws Current route views and peer status 23 http://www.openbmp.org
  • 24. Data Plane 24 3Ws: when, where, and what
  • 25. User / Agent Checks • Treat network as a black box: are your beacon services working? • Synthetic service check (HTTP, DNS, etc.) • Ping (not all remotes will respond) • Data plane is exercised and tested • Variety = better coverage (multiple IP addresses / L4 ports per location) • Validate similar treatment (QoS) as real user traffic • Uptime and performance (loss, latency) metrics • Look for patterns, changes from normal. All down vs some down. • Capture and validate real user (human) incidents. What got missed? • Use wisely: network and server resources consumed A B R1 R2 R5 R6 R3 25
  • 26. Latency Network Jitter Dist. of Stats Connectivity Packet Loss FTP DNS DHCP TCPJitter ICMP UDPDLSW HTTP Network Performance Monitoring Service Level Agreement (SLA) Monitoring Network Assessment Multiprotocol Label Switching (MPLS) Monitoring VoIP MonitoringAvailability Trouble Shooting Operations Measurement Metrics Uses MIB Data Active Generated Traffic to Measure the Network Destination Source Responder LDP H.323 SIP RTP IP SLA IP SLA*(RFC 6812): Synthetic Traffic Measurements IP SLA IP SLA 26 *IP SLA can be replaced with other monitoring tools used by other vendors such as RPM of Juniper etc • IPSLA on router/switch – Shadow Router? • User end-system based agent software • Dedicated Agent
  • 27. traceroute • Understand the limitations • Sends 3 packets (default) at each TTL • Implementations • Linux/Cisco: UDP (ICMP and TCP-SYN are Linux optional) • UDP DST port # used to keep track of packets, increments per packet. Initial= 33434 (default) • SRC port #: randomized (linux), incrementing per packet (Cisco IOS) • Linux (GNU inetutils-traceroute) • UDP DST port# increments per TTL (not per packet) • SRC port is random but fixed per entire run • Windows: ICMP Echo request Widest dispersion against possibilities. Difficult to understand though. ICMP blocked frequently L Narrower dispersion. Story might be misleading. Internet: aka the TCP/80 network 27
  • 28. Unix traceroute • Multiple path options • Topology ‘shortcuts’ (same router seen at diff hop) • Ultimately all paths result in similar e2e delay 28 $ traceroute 62.2.88.172 traceroute to 62.2.88.172 (62.2.88.172), 30 hops max, 60 byte packets 1 152.22.242.65 (152.22.242.65) 1.044 ms 1.371 ms 1.585 ms 2 152.22.240.8 (152.22.240.8) 0.219 ms 0.328 ms 0.327 ms 3 128.109.70.9 (128.109.70.9) 1.066 ms 1.059 ms 1.168 ms 4 rtp7600-gw-to-dep7600-gw2.ncren.net (128.109.70.137) 1.634 ms 1.628 ms 1.736 ms 5 rlasr-gw-link1-to-rtp7600-gw.ncren.net (128.109.9.17) 5.354 ms 5.446 ms 5.557 ms 6 128.109.9.117 (128.109.9.117) 5.671 ms 128.109.9.170 (128.109.9.170) 7.141 ms 128.109.9.117 (128.109.9.117) 5.433 ms 7 wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net (128.109.1.105) 9.174 ms 128.109.1.209 (128.109.1.209) 8.256 ms 6.397 ms 8 dcp-brdr-03.inet.qwest.net (205.171.251.110) 18.414 ms chr-edge-03.inet.qwest.net (65.114.0.205) 27.353 ms 27.438 ms 9 dcp-brdr-03.inet.qwest.net (205.171.251.110) 21.739 ms 63-235-40-106.dia.static.qwest.net (63.235.40.106) 17.750 ms dcp-brdr-03.inet.qwest.net (205.171.251.110) 22.450 ms 10 63-235-40-106.dia.static.qwest.net (63.235.40.106) 22.531 ms 22.516 ms 84-116-130-173.aorta.net (84.116.130.173) 140.738 ms 11 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 140.831 ms 140.816 ms 84-116-130-173.aorta.net (84.116.130.173) 144.819 ms 12 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 144.074 ms 144.761 ms 84-116-130-58.aorta.net (84.116.130.58) 138.455 ms 13 84-116-130-58.aorta.net (84.116.130.58) 141.844 ms 141.924 ms 142.459 ms 14 84.116.204.234 (84.116.204.234) 145.603 ms 145.891 ms 145.987 ms 15 * * * 16 62-2-88-172.static.cablecom.ch (62.2.88.172) 268.281 ms 268.245 ms 268.176 ms 1 AAA 2 BBB 3 CCC 4 DDD 5 EEE 6 FGF 7 HII 8 JKK +10ms (unsustained) 9 JLJ 10 LLM +120ms (sustained) 11 NNM 12 NNO 13 PPP 14 QQQ 15 *** 16 RRR ~268ms (all three) filter + > 100 ms delay +120ms Atlantic crossing Reference
  • 29. Unix inetutils traceroute • Narrower view (no alternate paths directly seen) • Repeating nodes suggests multipath, or (unlikely) routing issue 29 $ inetutils-traceroute --resolve-hostname 62.2.88.172 traceroute to 62.2.88.172 (62.2.88.172), 64 hops max 1 152.22.242.65 (152.22.242.65) 0.783ms 0.727ms 0.798ms 2 152.22.240.8 (152.22.240.8) 0.226ms 0.228ms 0.221ms 3 128.109.70.9 (128.109.70.9) 0.967ms 0.980ms 0.962ms 4 128.109.70.137 (rtp7600-gw-to-dep7600-gw2.ncren.net) 1.576ms 1.598ms 1.567ms 5 128.109.9.17 (rlasr-gw-link1-to-rtp7600-gw.ncren.net) 5.149ms 5.140ms 5.126ms 6 128.109.9.166 (128.109.9.166) 7.113ms 7.098ms 7.306ms 7 128.109.1.209 (128.109.1.209) 7.835ms 8.326ms 7.958ms 8 65.114.0.205 (chr-edge-03.inet.qwest.net) 19.944ms 9.299ms 40.372ms 9 63.235.40.106 (63-235-40-106.dia.static.qwest.net) 18.442ms 18.412ms 18.432ms 10 63.235.40.106 (63-235-40-106.dia.static.qwest.net) 22.424ms 22.391ms 75.960ms 11 84.116.130.173 (84-116-130-173.aorta.net) 145.434ms 146.301ms 145.445ms 12 84.116.130.58 (84-116-130-58.aorta.net) 137.583ms 137.556ms 137.661ms 13 84.116.130.58 (84-116-130-58.aorta.net) 142.476ms 141.886ms 141.819ms 14 84.116.204.234 (84.116.204.234) 144.841ms 145.034ms 144.964ms 15 * * * 16 62.2.88.172 (62-2-88-172.static.cablecom.ch) 287.318ms 176.670ms 254.237ms Packets for hop 9,12 took a ‘shortcut’ and packets for hop 10,13 went long way Reference
  • 30. LFT • lft ‘layer 4 traceroute’ dynamically adjusts to responses • Firewall detection, whois and AS lookup integrated • Narrower packet changes, so narrower multi-path 30 $ sudo lft -ENA 62.2.88.172 Tracing ________________________________________________________________. TTL LFT trace to 62-2-88-172.static.cablecom.ch (62.2.88.172):80/tcp 1 [AS81] [NCREN-B22] 152.22.242.65 20.1/17.2ms 2 [AS81] [NCREN-B22] 152.22.240.8 20.1/20.1ms 3 [AS81] [CONCERT] 128.109.70.9 20.1/20.1ms 4 [AS81] [CONCERT] rtp7600-gw-to-dep7600-gw2.ncren.net (128.109.70.137) 20.1/20.1ms 5 [AS81] [CONCERT] rlasr-gw-link1-to-rtp7600-gw.ncren.net (128.109.9.17) 20.1/20.1ms 6 [AS81] [CONCERT] 128.109.9.117 20.1/20.1ms 7 [AS209] [unknown] chr-edge-03.inet.qwest.net (65.121.156.209) 20.1/19.5ms 8 [AS209] [QWEST-INET-35] dcp-brdr-03.inet.qwest.net (205.171.251.110) 20.1/18.4ms 9 [AS209] [QWEST-INET-17] 63-235-40-106.dia.static.qwest.net (63.235.40.106) 20.1/60.3ms 10 [AS6830] [84-RIPE/LGI-Infrastructure] 84-116-130-173.aorta.net (84.116.130.173) 160.7/160.7ms 11 [AS6830] [84-RIPE/LGI-Infrastructure] nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 160.7/160.7ms 12 [AS6830] [84-RIPE/LGI-Infrastructure] 84-116-130-58.aorta.net (84.116.130.58) 140.6/140.6ms ** [firewall] the next gateway may statefully inspect packets 13 [AS6830] [84-RIPE/LGI-Infrastructure] 84.116.204.234 160.7/160.6ms ** [neglected] no reply packets received from TTL 14 15 * [AS6830] [RIPE-C3/CC-HO841-NET] [target] 62-2-88-172.static.cablecom.ch (62.2.88.172):80 160.7ms Used tcp/80 SYN Reference
  • 31. MTR • Interactive combined traceroute and ping • Gives a sense of health of path (loss, delay Standard Deviation) • Narrow path view 31 Reference $ mtr 62.2.88.172 aakhter-nlr-ubuntu-01 (0.0.0.0) Sat May 30 18:57:09 2015 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. 152.22.242.65 0.0% 145 0.8 0.9 0.7 10.0 0.8 2. 152.22.240.8 0.0% 145 0.3 0.2 0.2 0.3 0.0 3. 128.109.70.9 0.0% 145 1.0 3.3 1.0 182.3 17.2 4. rtp7600-gw-to-dep7600-gw2.ncren.net 1.0% 145 9.2 4.1 1.6 203.4 18.6 5. rlasr-gw-link1-to-rtp7600-gw.ncren.net 0.0% 145 5.3 5.3 5.1 6.8 0.2 6. 128.109.9.166 0.0% 145 7.1 7.3 7.1 16.1 0.8 7. wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net 0.0% 145 6.8 8.3 6.2 10.6 1.0 8. chr-edge-03.inet.qwest.net 0.0% 145 9.4 12.3 9.3 62.1 9.5 9. dcp-brdr-03.inet.qwest.net 0.0% 145 21.8 22.8 21.7 70.7 5.5 10. 63-235-40-106.dia.static.qwest.net 0.0% 145 21.8 24.5 21.7 86.1 10.6 11. 84-116-130-173.aorta.net 0.0% 145 144.8 145.0 144.7 152.9 1.0 12. nl-ams02a-rd1-te0-2-0-2.aorta.net 0.0% 145 144.1 145.5 144.0 165.4 3.7 13. 84-116-130-58.aorta.net 5.0% 144 142.9 142.3 142.0 145.6 0.4 14. 84.116.204.234 5.0% 144 145.1 145.1 144.9 145.3 0.0 15. 217-168-62-150.static.cablecom.ch 5.0% 144 145.9 146.1 145.2 164.3 1.9 16. 62-2-88-172.static.cablecom.ch 5.0% 144 313.0 260.3 152.6 508.0 80.0 Note variability, probably just the end system Just local noise, no carry over to later hops Sustained loss. Likely something wrong 12->13, or way back
  • 32. Check interface • Classic command • Check interface ‘up’ status • Stability: check log event or check routing table stability • Monitor in/out bit/packet changes # show interface GigabitEthernet1 is up, line protocol is up Hardware is CSR vNIC, address is 000c.291a.7f97 (bia 000c.291a.7f97) Internet address is 192.168.225.130/24 MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Full Duplex, 1000Mbps, link type is auto, media type is RJ45 output flow-control is unsupported, input flow-control is unsupported ARP type: ARPA, ARP Timeout 04:00:00 Last input 00:05:35, output 00:09:58, output hang never Last clearing of "show interface" counters never Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 0 bits/sec, 0 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec 25349 packets input, 2381158 bytes, 0 no buffer Received 0 broadcasts (0 IP multicasts) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 0 multicast, 0 pause input 3958 packets output, 312408 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 56 unknown protocol drops 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 pause output 0 output buffer failures, 0 output buffers swapped out
  • 33. Follow the Flow with NetFlow(RFC 3954) • Per-Node: Data plane observations and decisions captured • Src/dst mac/IP/port#s, DSCP values, in/out interfaces, etc. • Network view: flows centrally analyzed- NetFlow collector/analyzer • Biggest value: strategically placed partial views (eg WAN edge) 33 A B R1 R2 R5 R6 R4 R3 NetFlow Collector LiveAction
  • 34. • Developed and patented at Cisco Systems in 1996 • NetFlow is the de facto standard for acquiring IP operational data • Standardized in IETF via IPFIX (RFC 7011) • Provides network and security monitoring, network planning, traffic analysis, and IP accounting • Packet capture is like a wire tap • NetFlow is like a phone bill NetFlow(RFC 3954)—What Is It? Network World Article—NetFlow Adoption on the Rise http://www.networkworld.com/newsletters/nsm/2005/0314nsm1.html 34
  • 35. Src. IP Dest. IP Source Port Dest. Port Protocol TOS Input I/F … Pkts 3.3.3.3 2.2.2.2 23 22078 6 0 E0 … 1100 Traffic Analysis Cache Flow Monitor 1 Traffic Non-Key Fields Packets Bytes Timestamps Next Hop Address Source IP Dest. IP Input I/F Flag … Pkts 3.3.3.3 2.2.2.2 E0 0 … 11000 Security Analysis Cache Flow Monitor 2 Key Fields Packet 1 Source IP 3.3.3.3 Dest IP 2.2.2.2 Input Interface Ethernet 0 SYN Flag 0 Non-Key Fields Packets Timestamps Flexible NetFlow Multiple Monitors with Unique Key Fields Key Fields Packet 1 Source IP 3.3.3.3 Destination IP 2.2.2.2 Source Port 23 Destination Port 22078 Layer 3 Protocol TCP - 6 TOS Byte 0 Input Interface Ethernet 0 35
  • 36. • Flexible NetFlow Forwarding Status field captures forwarding (and drop reason) for flow. • Drop Count increments on any explicit drop by router NetFlow Forwarding Status & Drop Count Fields RFC 7270 36
  • 37. Network nodes are able to discover & validate RTP, TCP and IP-CBR traffic on hop by hop basis À la carte metric (loss, latency, jitter etc.) selections, applied on operator selected sets of traffic Allows for fault isolation and network span validation Per-application threshold and altering. Network Performance Monitor 37
  • 38. • RTP SSRC • RTP Jitter (min/max/mean) • Transport Counter (expected/loss) • Media Counter (bytes/packets/rate) • Media Event • Collection interval • TCP MSS • TCP round-trip time Performance Monitor Information Elements • CND - Client Network Delay (min/max/sum) • SND – Server Network Delay (min/max/sum) • ND – Network Delay (min/max/sum) • AD – Application Delay (min/max/sum) • Total Response Time (min/max/sum) • Total Transaction Time (min/max/sum) • Number of New Connections • Number of Late Responses • Number of Responses by Response Time (7- bucket histogram) • Number of Retransmissions • Number of Transactions • Client/Server Bytes • Client/Server Packets • L3 counter (bytes/packets) • Flow event • Flow direction • Client and server address • Source and destination address • Transport information • Input and output interfaces • L3 information (TTL, DSCP, TOS, etc.) • Application information (from deep packet inspection tool) • Monitoring class hierarchy Media Monitoring Application Response Time Other Metrics 38
  • 39. NetFlow QoS Analysis 39 Cisco Prime Infra LiveAction flow 5-tuple DPI/NBAR QoS processing DSCP How is my flow being classified? Did this QoS class drop traffic?
  • 40. Dedicated Protocol Analyzers • Wireshark and other protocol analyzers are great • Detailed analysis for variety of protocols at deep level • Dedicated probes are expensive to deploy pervasively • Operator has to make difficult judgment calls on where the problem is going to be– before it happens • Can be challenging after the fact- need on-site trained personnel. 40
  • 41. Embedded Packet Capture & Analyze • Capture packets locally to buffer on router • Store to flash, USB, FTP, TFTP for analysis in protocol analyzer • Capture does not add traffic to network LY-2851-8#monitor capture buffer pcap-buffer1 size 10000 max-size 1550 LY-2851-8#monitor capture point ip cef pcap-point1 g0/0 both LY-2851-8#monitor capture point associate pcap-point1 pcap-buffer1 LY-2851-8#monitor capture point start pcap-point1 LY-2851-8#monitor capture point stop pcap-point1 LY-2851-8#monitor capture buffer pcap-buffer1 export ftp://10.17.0.252/images/test.cap Gig0/0
  • 42. in-band OAM for IPv6 (iOAM6) • New IPv6 extension header defined on user packets • vs. IPv4 record-route option header • RFC 2460 does not define an option to record the route • Minimal performance hit (handled in data plane) • Packets continue on regular path • Instrumentation • Packet sequence numbers => detect packet loss • Time stamps => one way delay • Node and ingress/egress interface names => path recording • draft-brockners-inband-oam-requirements-03 42 Network Element Apps/Controller v6 traffic matrix Live flow tracing Delay distribution Bi-castíng control Loss matrix/ monitor App data monitoring Enhanced Telemetry Per hop and end-to-end data added to (selected) data traffic into the packet Node-ID Ingress i/f egress i/f Sequence# Timestamp App-Data
  • 43. iOAM6 Path Trace • Extended Ping H1#ping Protocol [ip]: ipv6 Target IPv6 address: ::A:1:1:0:1D Repeat count [5]: 1 Datagram size [100]: 300 Timeout in seconds [2]: Extended commands? [no]: yes Source address or interface: gig0/1 UDP protocol? [no]: Verbose? [no]: yes Precedence [0]: DSCP [0]: Include hop by hop Path Record option? [no]: yes Sweep range of sizes? [no]: Type escape sequence to abort. Sending 1, 300-byte ICMP Echos to ::A:1:1:0:1D, timeout is 2 seconds: (Gi0/1)R1(Gi0/2)----(Gi0/1)R4(Gi0/2)----(Gi0/2)R3(Gi0/3)----H3----(Gi0/3)R3(Gi0/2)----(Gi0/2)R4(Gi0/1)----(Gi0/2)R1(Gi0/1) Reply to request 0 (35 ms) Success rate is 100 percent (1/1), round-trip min/avg/max = 35/35/35 ms H1 R1 R3 H3 ::A:1:1:0:1D R2 R4 43 V6 extension header applied/decapped V6 extension header applied/decapped End system ICMP stack iOAM6 enabled
  • 45. Be Prepared! • Be prepared and have data collection systems enabled • Enable passive monitoring on endpoints and network • Enable active tests • Helpdesk • Interview Script => establish & maintain checklists • Multi-group access to tools, logs, etc. • Firefighters run drills, so should your teams! • Be familiar with the tools and how they respond on your network • Red phone: Cross-domain teams (applications, UC, security, servers) 45
  • 46. Expanding your Toolbox and Knowledge • Commercial and open source tools to look at • Network topology & IP address management: netdot, GestióIP • Performance tests: iperf3, netperf • Service checks: Nagios Core, Zenoss Core • NetFlow / Log analysis/moniroting: logstash, fluentd, splunk • Template driven config generation: ansible • Control Plane Troubleshooting (Troubleshooting IP Routing Protocols 46
  • 47. Network Documentation Tool (netdot) • Open source • Started in 2002 • Network interfaces discovery via SNMP • Discovery of L2 & L3 devices • Dynamically draws a diagram/topology of your network • Management of IPv4 and IPv6 address via IPAM 47
  • 49. GestióIP • Web based IP Management software • Concurrent users support • Better search and filter capabilities than traditional spreadsheet • Better statistical data • Less chance of human error compare to spreadsheet • Migration assistance from managed IPv4 to IPv6 via tool 49
  • 51. iperf3 • Active measurement tool to discover available path capacity • worst link and worst host configurations • Test can be in either direction (only static NAT works) • TCP (retransmissions, rate, cwd), SCTP and UDP (loss, jitter, out of order) tests 51 sender receiver TCP/5201 Test traffic: TCP, SCTP, UDP
  • 52. ∫∫∫∫∫∫∫ $ bwctl -T iperf3 -t 30 -O 4 -s "56m-ps-4x10.sox.net:4823" bwctl: Using tool: iperf3 bwctl: 40 seconds until test results available SENDER START Connecting to host 152.22.242.103, port 5160 [ 15] local 143.215.194.123 port 45609 connected to 152.22.242.103 port 5160 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 107 MBytes 898 Mbits/sec 0 3.06 MBytes (omitted) [ 15] 1.00-2.00 sec 112 MBytes 944 Mbits/sec 0 3.06 MBytes (omitted) … [ 15] 29.00-30.00 sec 112 MBytes 944 Mbits/sec 0 3.06 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-30.00 sec 3.29 GBytes 942 Mbits/sec 0 sender [ 15] 0.00-30.00 sec 3.29 GBytes 943 Mbits/sec receiver iperf Done. SENDER END Iperf3 examples $ $ bwctl -T iperf3 -t 30 -O 4 -c "56m-ps-4x10.sox.net:4823" bwctl: Using tool: iperf3 bwctl: 39 seconds until test results available SENDER START Connecting to host 143.215.194.123, port 5327 [ 15] local 152.22.242.103 port 44855 connected to 143.215.194.123 port 5327 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 5.14 MBytes 43.1 Mbits/sec 411 25.5 KBytes (omitted) [ 15] 1.00-2.00 sec 2.26 MBytes 19.0 Mbits/sec 15 19.8 KBytes (omitted) … [ 15] 28.00-29.00 sec 2.26 MBytes 18.9 Mbits/sec 16 25.5 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-30.00 sec 59.8 MBytes 16.7 Mbits/sec 539 sender [ 15] 0.00-30.00 sec 60.7 MBytes 17.0 Mbits/sec receiver iperf Done. SENDER END Client to server (local to remote) Throw away stats from first 4 sec Run for 30 sec ~19mbps (local to remote) retransmissions ~940 mbps (remote to local) Use –P for parallel streams
  • 53. ∫∫∫∫∫∫∫ • Similar to iperf3 but: • Works bidirectionally in a NAT environment • additional connection/per second and transaction/per second tests • statistical confidence intervals (-I) netperf > netperf -t TCP_STREAM -H 162.209.79.211 -i 30,10 -I 95,5 -j -l 60 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 162.209.79.211 () port 0 AF_INET : +/-2.500% @ 95% conf. : demo !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 8.965% !!! Local CPU util : 0.000% !!! Remote CPU util : 0.000% Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.52 13.91 download
  • 54. Nagios Core 54 • Monitoring and alerting engine • Open source written in C • Flexible and scalable architecture • Event scheduler, event processor and alert manager • APIs can be used to extend the capabilities to perform additional tasks • Designed for Unix & Linux systems
  • 56. Zenoss Core 56 • Network Management System • Open source • Written in 90% Python, 10% Java • GNU based license • Web based interface for monitoring • Used by Govt sector, Retails, financial institutions, SP and tech industry
  • 58. logstash 58 • Open source tool for managing events and logs • Data/logs Collection(many sources), filter and display logs • Scalable data processing • Analysis, Archiving, Monitoring & Alerting • Elasticsearch API is used for storage • Kibana (a broswer based analytics) is developed to view Logstash data
  • 59. fluentd 59 • Open source data collector • Written in C and Ruby • Requires very little system resource • Simple and flexible/extensible • 5000+ companies are using fluentd for data collection • Provides unified logging layer • Decouples data sources from backend systems
  • 60. ansible 60 • Automates software provisioning and configuration management • Use for application deployment/migration • Reduces complexity and repetition • Comes with Fedora distribution of Linux • Used for Linux, Unix and Windows • Written in python and powershell • Can be used in the cloud environment also such as AWS, Azure etc.
  • 61. Splunk 61 • Accessible via standard browser or via mobile app • Collects and index data • Powerful statistical search of the data • Correlate and investigate between events and activities • Display reports in a customized dashboard • Turn searches into real time alert • Notifies via email or RSS
  • 63. 63 Alerting & Collaboration • Routing of alerts / interesting events • Is this noise or signal? • Which team(s) to alert? • Who is on duty? • How to contact: SMS, IM, phone call… • Pagerduty, Openduty • Coordinating response • IM tools (Spark, hipchat etc.) • Email • Ticketing tools (OTRS, Jira, ServiceNow, Moogsoft…) PagerDuty