BGP is famously known for being the routing protocol of the internet, but not only. A lot of companies are using BGP in their internal network. One of the main reasons for that: it allows great control and flexibility with routing exchange.
As with any routing protocol and critical connectivity, we want to ensure we are resilient to failure and that the resiliency is operational. With BGP, to do so, we need to have a closer look at the received prefixes.
Let's take a classic example, where you would have two devices connected to an external party. This could be your ISP, MPLS provider…
In this example, site L47, on the left, is connected to site L21 which is your MPLS cloud.
You have your two eBGP sessions established with the MPLS cloud, that's great! Now you want to ensure that you are resilient by checking the receive routes from the provider on both links. This is how you will be able to reach any of your sites connected to the MPLS cloud.
In this situation, if one link fails, you have full resiliency. You will not lose connectivity to your MPLS network and all the services hosted outside site L47.
In this situation, if you lose L47EXR1 or the link between this router and site L21, your secondary link will not be able to forward any traffic. L47EXR2 is not receiving any routes.
This is what you would see on your device:
The fact that you can see State/PfxRcd showing "0" tells you the BGP session is established, otherwise, you would see the current state (Idle, Active...). Any other number would be good, unfortunately in this scenario, you are not receiving any route.
You are now in a situation where you do not have any resiliency, and your monitoring system is not alerting us on the issue.
That’s a very good question! In a situation like this one, the BGP session is not affected, which means there is no error generated by the router. The device will not send Syslog messages or SNMP traps to inform your monitoring system that you are not receiving any routes.
That's where IP Fabric can play a crucial role to identify the issue quickly before it causes any damages. There are several ways to see the problem:
The IP Fabric dashboard provides an overview of the network analysis results, including issues and links to the detailed reports.
After seeing this table, you will want to check further details regarding the issue. Just click on the interesting number, and you will be redirected to the appropriate technology page and the intent verifications.
As we have seen at the beginning, you will be able to check directly on the diagram the number of received prefixes for a specific neighbor.
It's also possible to display on the diagram the verification information from all the supported technologies. In this example, we will add the BGP information on the diagram for site 47:
The moment we select the verification to add to the diagram, we can see L47EXR2 becomes red. If we click on the device, we will see the explanation on why it's showing red:
In addition, we can spot this issue by looking at the end-to-end path. Let's take an example where we are looking at the path from a source in site L47 to a destination in site L66, which is connected to the MPLS cloud.
In a normal situation, we would expect to see the traffic being able to use both links to reach the MPLS cloud:
If we now compare with the snapshot where we have the issue, we can clearly see that we have lost our resiliency:
You can't ensure BGP resiliency if you do not look at the received prefixes. The main issue with this situation is the fact that your monitoring system will not be able to inform you of the loss of received prefixes. The last thing you want is to face a massive outage because of one link failure and your secondary link was not operational.
I have been in that exact situation previously, and let's just say I did not enjoy writing the postmortem explaining why we did not know our redundant path wasn't working. It's not a situation you want to be in...
If you would like to find out more about IP Fabric and how it can help improve your existing infrastructure by detecting issues you are not aware of, please contact us through www.ipfabric.io! You can also follow our company’s LinkedIn or Blog, where more content will be emerging.
The fundamental purpose of a data network is to connect users with applications, data producers and consumers with storage. The most desirable feature of a network is to maximise availability of those services to the endpoints that consume them. In order to achieve that, we need to consider (amongst other things):
To be sure we can deliver availability then, we should design our networks with these tenets in mind.
Sometimes though, organic growth in your IT environment means that the original design principles can't be upheld. You need to change your network to fulfil a new requirement and fast! And so you build a change to your network that compromises the original design principles but provides a solution.
Often, this means connecting a new link to a new device which is not big enough, fast enough or resilient. Or some unseen configuration issue prevents redundant links from being available for use. Then, when an issue occurs later, you and your users suffer because the principles you laid out for maximising availability haven't been upheld and systems fall over. You are then left trying to trace the reason why things have failed.
Where do you start? Your diagrams and design documents were great when you built your network, but no one has kept them updated. And so, you have to start where they end, and:
then you sit down with your network team to analyse where your single points of failure are and what you can do about them. Don't forget that they might not be immediately obvious in the physical topology, but in the logical configuration!
Alternatively you could give the job to IP Fabric.
Each time IP Fabric runs a snapshot, it rediscovers the network topology from the physical layer upwards. Clicking through the Diagram | Site Diagrams menu shows a visualisation of the topology:
After turning off Layers 1 and 3 for clarity, then selecting the Options tab and clicking "Single Points of Failure", IP Fabric highlights them with red outlines:
IP Fabric is able to analyse and interpret relationships between devices both upstream and downstream. In this case, switches are shown as SPOFs because they are the sole upstream devices for some access switches.
As we've seen, IP Fabric's powerful visualisation helps us appreciate problems in topologies, but there are alternative and equally powerful views of the network. For example, you can check a simulated path between hosts for single points of failure. Select Diagrams | End-to-end path. After submitting source and destination IPs, non-redundant devices can be highlighted along the path.
Subtly different, you are also able to highlight non-redundant links, showing where there is a strict dependency on the link to ensure an end-to-end path between the hosts.
So you can see how IP Fabric has saved you huge amounts of time and effort spotting single points of failure in your network topology. And now you are in a position to proactively remediate before this becomes a problem!
If you have found this article helpful, please follow our company’s LinkedIn or Blog, where more content will be emerging. If you would like to test our solution to see for yourself how IP Fabric can help you manage your network more effectively, please contact us through www.ipfabric.io.