The Top 25 Network Problems and Their Business Impact (Part 2)
In our last post we posted the first 13 top network problems and how it can impact the applications that our business needs to operate. In this issue we reveal issue 14 to 25 and how you can communicate this to the business people in your organization.
14. OSPF recalculations high
Routing protocol unstable; poor and inconsistent application performance. Link stability, link errors, or spanning tree stability can cause an OSPF topology to be unstable. The routing protocol may intermittently select non-optimum paths. Applications experience high jitter or loss of connectivity if routes are flapping as a result.
15. Poor VoIP quality
Due to high jitter, delay, or packet loss; Choppy voice calls; Calls mysteriously disconnect. The root cause of poor VoIP quality can be many other problems. By monitoring delay, jitter, and packet loss, you can reduce the set of possible problems to examine. By identifying the range of phones that are reporting poor statistics, you can better identify the potential source of the problem.
16. Routing Neighbor changes high
Access via this router is negatively affected by a high number of neighbor changes (BGP, OSPF, EIGRP). Similar to problems 13 and 14, something is causing the neighbor relationships to change regularly, which affects the stability and reliability of the routing protocol. As a result, applications can experience high jitter or packets arriving out of order. Finding and fixing the cause of the neighbor changes will result in a more stable and efficient network.
17. OSPF area not connected to backbone
The disconnected OSPF area will not be reachable from other OSPF areas, impacting applications that need to communicate between areas. OSPF intra-area routing relies on connectivity through the backbone area (area 0). When an area is disconnected from the backbone, communications within the area works, but communications between systems in that area and systems in other areas will not work (the intra-area routes don’t exist). Users and systems within the area will report what seems to be intermittent connectivity, which is based on whether the destination is located within the area or in another area.
18. Unidirectional traffic flow
Typically the result of misconfigured routing, application traffic will be using non-optimum paths, increasing delay and potentially overloading other links and affecting other applications. Sometimes asymmetric routing is desired; however, it increases network complexity and complicates troubleshooting. Servers are often configured with incoming and outgoing interfaces, which may cause unicast flooding, a condition in which frames are sent to all ports in a VLAN. High traffic levels result, impacting the operation of all devices in the VLAN. In routed networks, a measure of zero packets in one direction on a link for long time periods indicates a potential routing misconfiguration.
19. Router interface down
Any router interface marked administratively up but is operationally down is likely to be a redundant connection that will cause an outage if the other connection also fails, affecting all applications that use it. Redundant networks hide first failures, so it is important to identify those failures before a second failure causes an outage. Best practices are to administratively shutdown router interfaces that are not supposed to be active, therefore making any interface in up/down state an indication of something that’s failed.
20. Unstable root bridge
Bridge priority not set; applications quit working over unstable VLANs. An inexpensive switch that has the same bridge priority but lower MAC address as the desired root bridge in a spanning tree will try to become the root bridge. But in a busy VLAN, it may not have the backplane bandwidth or CPU to handle the task and not send BPDUs as frequently as it should (2 seconds by default). When several BPDUs are missed, the other switches elect another switch as the root. The STP re-convergence will affect application connectivity. The change is difficult to troubleshoot because it is working by the time a network engineer looks at it. Application connectivity seems to be intermittent.
21. Duplex mismatch
Increasing link errors; Applications get slower as traffic volume increases. CRC errors, late collisions, and FCS errors are indicators of duplex mismatch. A server is installed and ping works, so it is declared functional, but as the traffic to it builds, errors increase. Finger pointing between the network, server, and application teams often results until the duplex mismatch is discovered. Vendor recommendations (Microsoft: fixed full duplex; Cisco: auto-negotiate) exacerbate the problem.
22. Downstream hub or switch
Unauthorized devices added to the network; Compromise to network integrity and security; See 20. Wireless routers, switches, hubs, and other network devices should be under a common administration in order to provide the best network security. Another switch could have a lower priority, making it the root bridge of a VLAN and causing stability problems (see 20). Rogue DHCP servers in wireless routers can cause intermittent connectivity problems within a subnet, unless specific configurations protect against it.
23. Port in ErrDisable state
The set of end stations connected via this port are disconnected from the network until the port is enabled (either automatically or by user control). A variety of configuration options allow switch ports to be disabled when certain conditions occur, such as receiving BPDUs or DHCP responses (see 20, 22). Some vendors will disable a port if it experiences too many errors. Automatically identifying these ports can avoid a trouble call from a user or server administrator who is having connectivity problems as a result of a port being disabled.
24. Unbalanced & unused ether-channels
Increased latency & jitter affecting sensitive applications like VoIP; Compromised redundancy. Packet distribution across an ether-channel may be unbalanced if a non-optimum packet distribution algorithm is selected. By changing the algorithm, the ether-channel packet distribution is more balanced and overall throughput increases. An unbalanced ether-channel will be more easily congested, resulting in application performance that’s less than expected.
25. HSRP or VRRP peer not found
Redundancy configured and not operating correctly; Outage when a second failure occurs. A connectivity or application outage may have not yet occurred, because one device in the redundant pair is still running. But the backup device is not known. The cause may be a broken link between devices, the redundant device has not yet been installed, or the redundant device, or its interface, has failed. When the second failure in the redundant configuration occurs, a network outage occurs, impacting applications. Knowing that a redundant configuration is not operational allows it to be corrected before important applications are affected. Identifying and correcting these problems will allow your network to better service your business’ network requirements.
We want to thank NetCordia for sharing these with us and how NetMRI has helped them to discover these problems and reduce network outages.