Setting continuous network verification for the specific cause of the issue can be impractical or downright impossible. Find out how you can proactively increase network reliability using continuous verification of operational intent of the network infrastructure in the second chapter of our popular webinar.
Transcript
Hello, and welcome to IP Fabric Webinar about how we can make our networks more reliable. Specifically, we are going to look at, concept of network assurance and, operational intent verification. When you are looking at a complex global networks, there are myriads of parameters and technologies that you have to manage. When an outage, occurs, you need to understand not only what caused the outage, but also how to prevent it. In many cases, the road to understanding how to prevent a specific outage can take many days, weeks, or even months.
Sometimes the problem is never found because replicating the underlying parameters of the issue that, caused the outage can be beyond the abilities of, network operators, specifically because there is a lot of manual activities involved together, basic information about network state, because of the size and scale of global network infrastructures. You might opt to replicate the issue by causing an outage again during a scheduled maintenance window. You can also, try to understand the issue from the recorded data, which is, of course, much less impactful but requires, you to collect the data during the outage. That, of course, is usually not within the realms of possibilities of, network operators because when an outage occurs, the goal of network operators is to recover the failed component as soon as possible. There is usually no time for, root cause analysis.
We want to, not to understand the underlying cause of the issue, of the overall service outage, but we want to, replace or fix the failed component as soon as possible. Sound network design should provide redundancy to critical components so that when any of the components fail, the network or the, service that the network provides keeps functioning. Of course, to be able to provide, a an always on service is, a logical concept but very hard to achieve. So when an outage does occurs, you really need to focus on the not only on the fixing the issue as soon as possible, but also to, analyze the network deeper and understand what has caused the issue in the first place. In our case, with, IP Fabric solution, network operators were able to take, a snapshot during the outage and focused on fixing the, outage as soon as possible.
So thanks to IP fabric solution, we are able to perform, root cause analysis to find out why the services, were impacted, and if so, determine the root cause. So let's look how this how this work in practice. Here we have, IP Fabric platform web interface where we can see that, there are a number of, snapshots in the system. Because we want to analyze, specific issues, we have, our, snapshot from the outage that happened in May, currently loaded. We know that, the network baseline when extensive user acceptance tests and application test was carried out was on March when the network funk functioned, exactly as was per the acceptance test.
So we know that we have a snapshot of the network to compare. What IP fabric platform does is it collects network state and configuration data from every element in the network providing you full history of network state. In addition, it allows you to, simulate network behavior, based on, network algorithms in the underlying graph mathematical network model. Essentially, what, the platform creates a digital copy of your network infrastructure, which can be used for analysis. This depth allows you to drill down to, much deeper parameters than network monitoring system would ever allow you to.
For example, when looking in the inventory here, we see a list of devices, but network information is much deeper than that. So we, can go into network diagrams and see specifically how, sites are interconnected, which protocols are operating in the sites, analyze each protocol, or connection much more in-depth. And in this case, we know that operators provided us with information about which path, which service specifically had an issue so we can, we we can load this view and look at this this specific scenario. In this case, we have a, source host which communicates to a server. And, we can see that, this service actually, is, is not, is not working, and we can actually see the reason why the service was not functioning.
So over here, we see that this, path would not be available because of, an issue. And the issue is not with forwarding, but it is with the zone matching, firewall rules. So, one of the rule on a zone based firewall, denies this traffic. But because, this behavior is observed during the baseline, this is actually expected. So right away, we see that, there is inconsistency in reporting from the network operators when they did the test because they weren't testing the correct, the correct behavior.
After researching, you find out that, the application actually runs over HTTP, so you can rerun the simulation. And we can see that this path, actually, actually works. So what we don't need to look at redundancy to write down. So, when looking at this, specific path, we can see that, the path is, available and the path, actually passes. So now what we need to see is how this session, is how this service behaved during the outage.
So we can switch to the snapshot that was taken during the outage, and we can actually see that, the path ended from the initial host. It passed to layer 2. It reached layer 3, but, actually, the path died at the first hop. We can click and see exact exactly the reason for the path failure. And in this case, the path failed due to a routing issue.
So we can look at the the specific, at, at the route for this specific destination by looking at, the cumulative routing table. So this is an overall routing table of all of the routes in all of the, network. This is part of the network assurance concept when we are able to perform a much deeper understanding of our network infrastructure simply by having access to all of the data. And in this case, I'm interested in all of the routes that can route to this specific address. That's why we perform the routing lookup for the, with the prefixes that are not summarized.
So we can see that, actually, there was only a single prefix, and it's connected. So the network actually existed and was available, and we can compare the state of the routing stables tables, during the baseline by simply switching to snapshot. And we can see that there are, myriads of routes distributed all over the network for this specific, for this specific network. So, going back going going back to the outage analysis, we can look at the specific site where this route is, located, and, we can see that, this the route on the site was, available, but at the same time, we cannot see really why, why the outage, would occur. So we can look at the state of this specific network during the baseline, and we can see that there are more devices, but we still don't understand why a failure would affect, would would affect availability of the devices because, there is, it seems that everything is redundantly connected.
We can look at this, single points of failure or, non redundant links, and we can see that although the links are, redundant, there are single points of failure in the network because of how they are connected. We can see that there are no connections from this, firewall 10 and that, this firewall cluster is, not correctly connected on the on the layer 3. If we just concentrate on the layer 2 and layer 3, we can see that, they are missing, connections from the firewall and also, concentrating just on layer 3, we can see that, there are no, there is no redundancy, to the firewalls, and there is no redundancy to the, to the routers. So just just like that, we can see that there are redundancy issues in, our network. And the outage that happened happened because, because there is, the network is set up with sufficient device redundancy, but there isn't there aren't insufficient links interconnecting them.
We can then, create a rule for our, for all of our routing tables. Specifically, in the data center, you want, all of the, OSPF routes, to this, specific IP address, to be to be redundant. And if we create a rule to see how many routes there are that have only a single next hop, we can immediately see, the, all of the points that have to be fixed, which caused, which would, make the network redundant. And the next time a failure would occur, the service would still would stay, available. So setting up this continuous verification provides you or your network with assurance that such such a failure would, will never happen again, as long as you eliminate this risk from your network.
To summarize, IP Fabric platform, provides collects the data from network infrastructure devices on a continuous basis, in a snapshot fashion. It provides you this information for searching. It enables you to visualize this data, and run network simulations. The data is presented in a structured format, which provides you ability to, set up network assurance intent verification checks. And because you have, all of the snapshots available historically to you, the platform provides network history.
That's it. Thanks for, tuning in, and, visit IP Fabric IO, for for more information or to request a trial.
