Meet us at Cisco Live Las Vegas 2024

Every network is growing, every network is becoming more complex due to all the requirements coming from the technology evolution or customer needs, e.g. building the more reliable network connections, application-based routing, etc. And all these facts are the triggers for all the changes we are facing almost every day. Of course, there are small (not so important changes) but there are also complex network changes including core devices replacement, design changes impacting the business, and emergency changes that are caused by a failure …

I was managing small networks of units of devices, larger networks consisted of tens or hundreds of network devices but also the enterprise networks with thousands of routers, switches, firewalls and many other different network appliances and devices. I want to share some of my best practices from experience with all the changes I had prepared and executed.

Network change process

There should be a reason for every network change, but let’s exclude the emergency changes as it’s mostly about “how to fix the network so that the business can go on ASAP”. I want to focus on the planned changes for which we can prepare for properly — the best way is to have the change management process in place, including the CAB (Change Approval Board) that consists of technical reviewers and these are able to check the whole change preparation with the proper description, commands and devices to be implemented or installed, and also check the testing sequence that is needed to prove that the change is successful.

Ideal way is to follow these point to ensure we know what to do before we touch the network.

In case of the complex change it is better to divide the whole activity into the smaller pieces and prepare the tests for all these sub-tasks. It should avoid the complex troubleshooting during the maintenance window and review the commands we applied from the very beginning of the change …

Small change

Let’s imagine we need to implement small change caused by the change of the ISP’s PtP network on our MPLS connection. It means the only commands we need to implement is to modify the IP address on the WAN interface and modify the next-hop within our static route. So what to check? What kind of testing method should we choose?

I recommend to create a set of verification commands so they’re not based on a specific change. Just to have a possibility to check the health of the device, do a snapshot of the different tables to have a chance to prove there is some issue right before we start the change — e.g. we don’t have a connection to the corporate FTP server, the external connectivity is not working, the log shows some issue not related to our change etc. Based on the device type we should gather the information about the uptime, last reload reason, the logs from the device, NTP status, routing table, mac-address table, interface stats, ARP table, multicast table(s), etc. Then start with the verification the subject of the change — using ping and traceroute is the most usual way to check the routing — ping the ISP IP, ping any other off-site resource, ping some Internet IP (personally, I’m using the public DNS IPs — e.g. 8.8.8.8, 4.4.2.2 …) and also trace to these IPs. Then, check the specific command related to this change — interface configuration and default route information.

And most important fact is the reason why we are configuring the network services — it’s not just for fun, just to have the router/switch/firewall with the Internet connection, but mostly because it’s serving the customer to have a set of services they need for the business (except the lab environment). Therefore the most critical testing is the application testing procedure from the customer side and it should be executed directly by the customer or onsite user to confirm it’s working as expected. For basic pre-check see below

!
show clock
show version
show run
!
show ip int brie
show int description
show int e0/0
show int e0/0.3733
show ip route
show ip route 0.0.0.0
sh ip proto sum
sh ip ospf
sh ip ospf neighbors
sh ip ospf data
sh ip nat translation
sh ntp status
show log
!
!!! ping the ISP interface (old and new)
!
ping 10.73.254.1 rep 100 size 1024
ping 10.73.200.1 rep 100 size 1024
!
!!! ping the off-site resource
!
ping 10.0.20.20
!
!!! traceroute to the off-site resource
!
traceroute 10.0.20.20 prob 1 time 1 num
!
!!! ping the Internet IP
!
ping 8.8.8.8
!

The change activity should start with running the verification commands, then change the IP and default route in this case and running the same verification command set. I’m asking customer to run their own verification after I’m done with my portion of verification to avoid some high traffic load etc. Then, using some comparison tool to review the PRE and POST states:

1*ZzOp adtBB3Qfg3 LzUMpA

From the network perspective — the change has been implemented successfully, all the checks are as expected, the routing is working fine and we should ask the customer to do their tests — their own pings, traceroutes, application testing — e.g. working with the emails, VoIP services, core apps like SAP, web services, FTP services, etc.

Complex change

Let’s review all the steps required to proceed with this simple change — change one IP address. Analyze the risk and impact of that change, prepare the verification procedure, prepare the commands needed to implement the change, run all the checks, apply all the commands, run all the checks once again and compare the results. And now, add the some complexity to the trivial network change.

What about if the replacement of the core network device is needed? It means more preparation work and also more attention to the whole process … what about if the site-local dynamic routing protocol is planned to be replaced, e.g. replace the old RIP with the newer OSPF. It’s not about one step, about one or two commands and one or two ping/traceroute verification. I will go through the previously described process more deeply and will bring some additional details to each step.

1. Prepare all the commands

We should be prepared as much as possible and should have all the commands we will apply on all various devices to save the time during the change window. Many complex changes have to be planned very carefully and it is better to split the whole change to several stages with the specific command set to be applied

We should be prepared as much as possible and should have all the commands we will apply on all various devices to save the time during the change window. Many complex changes have to be planned very carefully and it is better to split the whole change to several stages with the specific command set to be applied

1*Xhv7GK RXmWkKjIQjFZtvQ

These “big actions” are mostly coordinated with the customer therefore it is good to define some milestones within the implementation plan and discuss the progress among the whole team. I will cover the rollback plan later in this chapter.

It is a definitely most critical part of that change process. Not from the functional point of view but from the change management process perspective. It must be proven the network is working at 100% and all the types of communications have to be tested and all the parameters verified.

It is still very important to have all the information we might need in one place — IP addressing of all devices, access credentials incl. local user access, topology diagrams, etc. It is also important to double check the order of the commands we will apply line by line — we must be very careful not to apply any command it will cut the device off. Personally, I caused it many times in the past — I applied the wrong access-list, I didn’t allow the correct VLAN on the uplink trunk, I configured the wrong IP address on the WAN interface, and did many other things… so if it is possible, it is good to use scheduled rollback or reload of the device to revert back the last known configuration (e.g. Cisco’s command reload in) …

2. Prepare all the verifications

It is a definitely most critical part of that change process. Not from the functional point of view but from the change management process perspective. It must be proven the network is working at 100% and all the types of communications have to be tested and all the parameters verified.

The big checks (PRE/POST and after the specific stage) should include the global verification — availability, path parameters and should be automated (will be covered later in text) and the small verification boxes represent just “local” testing after the specific commands are executed (e.g. verify the routing protocol is up and running, the ACL matches the trafrfic, the QoS marking is working etc.).

Most of the network devices we are configuring have two sides — the firewall is using inside and outside interfaces, the router is using WAN and LAN interfaces, the switch is connecting the hosts or other switches towards the core etc. Based on this fact, I recommend to use three different types of testing method:

a) checking connectivity TO THE DEVICE

It means we should check the availability of the configured device from the core or from the downstream direction. We can check the online/offline state with the ping tool and we can check the incoming direction with the traceroute command. The IP address can be the management loopback, WAN IP address or any other IP important for the communication with this device.

b) checking connectivity FROM THE DEVICE

It is very important to have a list of corporate resources that should be available throughout the whole network — FTP server, NTP server, VPN hub, … and the best way how to check this part is to ping all these resources (IPs) and trace the path to them to ensure it’s working as expected.

c) checking connectivity THROUGH THE DEVICE

It may happen that the network is not working even the previous two checks are OK. Why? Because the traffic shouldn’t pass the device correctly — there could be a problem with the advertisement of the local prefixes, the issue with the redistribution, the L2 switch VLAN ACL or mismatched filtering within the FW ACL. So pinging the core or corporate resources from the LAN (or inside) AND pinging the LAN/inside IPs from the core is a must in this case to ensure both directions are OK.

3. Prepare the rollback plan

As mentioned before, milestones or check points should be defined to have a possibility to decide whether to continue or not with the network change (go-no-go decision). They are usually after verification at every stage:

1*WaeBgNZ0VHy kZG9SI OHA

These three milestones from our example can be described as follows:

The rollback plan should contain the steps for all the devices that are being touched during the change window — re-cabling instructions and command set for each milestone and its own verification procedure. The global post-check verification is executed after the rollback is completed and comparison with the pre-check output is mandatory.

Verification tasks automation

The amount of the verification is growing with the complexity of the change and therefore we should try to save the energy by using some automation tools. I omit the commercial tools for monitoring the network resources and want to describe some various possibilities I’ve encountered during my work.

I will use just three verification commands they’re implemented on all OS I have met — ping to check if the destination IP is available or not, traceroute to check the exact path between the source and the destination IP and telnet to check various L4 ports. If I use the same differentiation of the verification directons as described above, I will talk about only two directions from that list only — verification executed on the device that is being changed and from any other location towards this device.

1. Verification from the impacted device

It’s easier to execute the verification procedure from the impacted device as we are still at one prompt and only challenge is to process the command by command.

1*3196pZVo4U8eaWVKOkYCvQ

Cisco’s easiest way is to use the TCL script to run the verification commands by using native commands mentioned above. It can be stored as a function and called any time during the change window or it can be processed just once as we need it.

Let’s demonstrate how it works and the difference between the various possibilities. Let’s have three IP addresses we will use for our verification from one device — 10.0.20.1, 10.0.20.20 and 10.0.20.21. I have prepared the set of verification commands:

!
ping 10.0.20.1 rep 2
ping 10.0.20.20 rep 2
ping 10.0.20.21 rep 2
!
traceroute 10.0.20.1 prob 1 time 1 num
traceroute 10.0.20.20 prob 1 time 1 num
traceroute 10.0.20.21 prob 1 time 1 num
!

Of course I can use standard copy&paste function but the problem is processing the input via the TTY session. Next exhibit shows the result when using copy&paste — it can be seen one ping command is missing (even it was within the pasted text) and two traceroute commands are also missing …

L73EXR1#ping 10.0.20.1 rep 2
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 10.0.20.1, timeout is 2 seconds:
!!
Success rate is 100 percent (2/2), round-trip min/avg/max = 4/4/5 ms
L73EXR1#ping 10.0.20.20 rep 2
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 10.0.20.20, timeout is 2 seconds:
!!
Success rate is 100 percent (2/2), round-trip min/avg/max = 5/11/18 ms
L73EXR1#traceroute 10.0.20.21 prob 1 time 1 num
Type escape sequence to abort.
Tracing the route to 10.0.20.21
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 1 msec
3 10.31.254.2 [AS 21] 3 msec
4 *
5 10.31.47.115 [AS 31] 2 msec
6 10.31.49.117 [AS 31] 1 msec
7 10.0.11.254 [AS 31] 4 msec
8 10.0.20.21 [AS 31] 5 msec

We need ensure that every command is pasted/processed once the prompt is ready. With Cisco TCL script we can modify the verification set as follows:

foreach address {
10.0.20.1
10.0.20.20
10.0.20.21} { ping $address rep 3
}

It means we can enter the TCL mode by typing tclsh command and paste the text above. It will process command by command as needed. Then, we can expand the TCL script and add the traceroute or any other Cisco command with the IP address as an argument. To make it simpler, it’s possible to create the procedure called any name, put all the verification into the brackets and then, it can be called any time from the tclsh prompt. The example of our verification set:

proc CheckFROMonce {args} {
foreach address {
10.0.20.1
10.0.20.20
10.0.20.21} { ping $address rep 3
}
foreach address {
10.0.20.1
10.0.20.20
10.0.20.21} { trace $address prob 1 time 1 num
}
}

Then, let’s call the procedure once:

L73EXR1(tcl)#CheckFROMonce
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 10.0.20.1, timeout is 2 seconds:
!!!
Success rate is 100 percent (3/3), round-trip min/avg/max = 4/6/8 ms
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 10.0.20.20, timeout is 2 seconds:
!!!
Success rate is 100 percent (3/3), round-trip min/avg/max = 2/8/14 ms
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 10.0.20.21, timeout is 2 seconds:
!!!
Success rate is 100 percent (3/3), round-trip min/avg/max = 2/5/10 ms
Type escape sequence to abort.
Tracing the route to 10.0.20.1
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 3 msec
3 10.31.254.2 [AS 21] 3 msec
4 *
5 10.31.47.115 [AS 31] 7 msec
6 10.31.49.117 [AS 31] 5 msec
7 10.0.11.254 [AS 31] 5 msec
8 10.0.20.1 [AS 31] 4 msec
Type escape sequence to abort.
Tracing the route to 10.0.20.20
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 1 msec
3 10.31.254.2 [AS 21] 7 msec
4 *
5 10.31.47.115 [AS 31] 7 msec
6 10.31.49.117 [AS 31] 6 msec
7 10.0.11.254 [AS 31] 9 msec
8 10.0.20.20 [AS 31] 3 msec
Type escape sequence to abort.
Tracing the route to 10.0.20.21
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 2 msec
3 10.31.254.2 [AS 21] 5 msec
4 *
5 10.31.47.115 [AS 31] 2 msec
6 10.31.49.117 [AS 31] 2 msec
7 10.0.11.254 [AS 31] 3 msec
8 10.0.20.21 [AS 31] 3 msec
L73EXR1(tcl)#

In case we cannot use the TCL or we don’t want to use it for any kind of reason, we can use own script based on the tool or scripting language we are using — it can be PERL or PYTHON script with direct SSH/Telnet module or it can be served by VBScript within the SecureCRT client. The easiest VBScript without loading external files (e.g. device list or command set) is as follows:

#$language = "VBScript"
#$interface = "1.0"
crt.Screen.Synchronous = True
sub Main
  crt.Session.Log False
crt.Session.LogFileName = "logs\CheckFROMonce.log"
crt.Session.Log True
  crt.Screen.Synchronous = True
  crt.Screen.Send "ping 10.0.20.1 rep 2" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "ping 10.0.20.20 rep 2" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "ping 10.0.20.21 rep 2" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "trace 10.0.20.1 prob 1 time 1 num" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "trace 10.0.20.20 prob 1 time 1 num" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "trace 10.0.20.21 prob 1 time 1 num" & vbcr
crt.Screen.WaitForString "# "
  If crt.Session.Logging Then      
crt.Session.Log False
End If
  crt.Sleep 500
MsgBox "DONE: Generic Scripting"
end Sub

There is a lot of options how to proceed with this type of task but the aim is the same for all methods — save the time by pasting the commands, avoid the human error by using one verification procedure in all the verification steps and to have the same-formatted log to be able to compare the results any time during or after the change.

2. Verification to or through the impacted device

A little bit different scenario is the second part of verification and it’s testing the function of the impacted device for the traffic that is destined to the device or is passing through the device.

To fulfill this task we should login to the various devices and execute some specific set of command. The command set can be equal to check the same connectivity check from various network sources or additionally, it can be different and test different services or destinations.

I recommend to include these verification:

The automation of these verification steps can be done by using some scripting language and preparing the specific device-list as an input file and executing the verification command set from these devices step by step.

The simple BASH script that should be located and executed somewhere in the network far from the change location could be used to monitor the local resources incl. the management IP, local prefixes etc.

#!/bin/bash
HOSTS="10.0.20.73 10.73.255.1"
C=2
while true
do
for conHost in $HOSTS
do
result=$(ping -c $C $conHost | grep 'received' | awk -F',' '{ print $2 }' | awk '{ print $1 }')
if [ $result -eq 0 ]; then
echo "Host $conHost is down! at $(date)"
fi
done
done

We can specify as many hosts as needed and check the connection state — if the ping result is OK (the IP is reachable), it’s showing no output and if there is an issue with the reachability, it will display the information the host is down and the date/time detail:

trehor@labremote:~$ ./monitor.sh  
Host 10.73.255.1 is down at Thu Jul 26 06:39:41 CEST 2018
Host 10.73.255.1 is down at Thu Jul 26 06:39:53 CEST 2018
Host 10.73.255.1 is down at Thu Jul 26 06:40:05 CEST 2018
^C
trehor@labremote:~$

Because of the infinite loop within the BASH script, you must terminate the monitoring script by pressing CTRL+C.

In my opinion, the verification is very important part of the network changes regardless it’s a simple network change (changing the next-hop of the static route) or very complex change (core router replacement, local dynamic routing protocol replacement, adding the second router with the secondary connection etc.) or impacting one device or including 10+ devices. It must be prepared and reviewed in advance to avoid missing tests and mostly, we should collect as much as possible from the testing logs and be able to compare the PRE/POST outputs to prove the network behavior is the same after we finish the network change as it was before we started. And sometimes — these logs can help us with the troubleshooting several days after the change …

If you’re interested in learning more about how IP Fabric’s platform can help you with analytics or intended network behavior reporting, contact us through our website, request a demo, follow this blog or sign up for our webinars.

One of the features that’s included in IP Fabric's platform from the beginning of time is variability of network discovery inputs. First of all is Discovery Seed, which creates an entry point for discovery. Another functional feature is Filtering, which narrows down to what to include and what to exclude from discovery and analysis. But few things first. For successful discovery we need to have proper licensing in place and valid authentication settings which is not covered in this article.

P Fabric homepage snippet
IMG1: IP Fabric homepage snippet

Managing discovery seed

It’s an optional feature, but will help significantly with defining network discovery start-point. If not set, algorithm will begin from the current default gateway, which in many cases is not the best scenario. In my lab, I know there’s a subnet 10.241.255.0/24, that I’d like to discover. In the menu it can be found in Settings > Discovery Seed.

Network Discovery Seed definition
GIF1: Network Discovery Seed definition

With only this setup, IP Fabric was able to discover 24 devices in the lab, all within subnet 10.241.255.0/24. But we can also apply the filtering. By navigating to Settings > Advanced, we can define IP scope in greater detail by including or excluding networks or single IPs.

Applying discovery filter
GIF2: Applying discovery filter

By applying the filter, I managed to discover only 20 network devices instead of 24, all within 10.241.255.0/25 subnet only. More networks can be easily applied to either Discovery seed or subsequent filter. And there’s even more interesting options that elevate IP Fabric versatility. Few to mention could be authentication per subnet, defining the trace route scope or site separation types (defined by Routing and Switching domain or by RegEx based on hostname). SSH and Telnet basic values can be played with like authentication failure retries or login timeout value and more.

Finished discovery for 10.241.255.0/25
IMG2: Finished discovery for 10.241.255.0/25

If you’re interested in learning more about how IP Fabric’s platform can help you with analytics or intended network behavior reporting, contact us through our website, request a demo, follow this blog or sign up for our webinars.

To begin with, IP network gateway redundancy has become a very standard high availability solution. In IP Fabric's platform it’s described as First Hop Redundancy Protocol (FHRP) feature that currently umbrellas HSRP, VRRP and GLBP protocol.

I have prepared a little scenario in our virtual lab with a LAN switched network protected by a firewall that is forwarding all traffic out with a default static route towards virtual VRRP gateway. The virtual gateway is created with the help of two Juniper boxes, vMX and vSRX in packet mode with no security policies defined (functions as a router).

From lab theory to discovery practice

After successful discovery with IP Fabric version 2.2.5 we will confirm correct VRRP setup by using diagrams and by slightly modifying our view.

FHRP discovery in diagrams
Gif1: FHRP discovery in diagrams

There’s a simple ring topology consisting of 6 routers total (3 of them are SRX, that are considered correctly as device with firewall capabilities, specifically static1r16, static1r17 and static1r18, but all act as routers). Other three routers in the ring are vMX routers.

First, I would like to verify that VRRP is operational and correct virtual gateway is active. In IP Fabric in diagram Options panel we check ‘Show FHRP’ and if configured properly, both gateways would pop-up on the diagram. Afterwards, by simply navigating to FHRP yellow button in diagram, we would discover further VRRP setup, which seems to be correct. Router static1r18 is supposed to be the primary gateway. It’s closer towards exit points and it has greater link capacity with LACP interface compared to router static1r17.

Verify static routing

We are currently seeing only IP related connections between routers, there’s single area OSPF as primary IGP protocol, with static routing already mentioned above (originating from static1fw56–1, pointing towards VRRP gateway). Well, let’s verify that.

Static routing verification
Gif2: Static routing verification

There’re more options about how to obtain routing information in IP Fabric, but we can go directly from our current diagram. We simply click the firewall node on screen and all information are there. In the routes panel we only filtered all routes for ‘S’ (static routes). We verified next hop IP and interface, route metric and it seems we are good to go!

If you’re interested in learning more about how IP Fabric’s platform can help you with analytics or intended network behavior reporting, contact us through our website, request a demo, follow this blog or sign up for our webinars.

We're Hiring!
Join the Team and be part of the Future of Network Automation
Available Positions
IP Fabric, Inc.
115 BROADWAY, 5th Floor
NEW YORK NY, 10006
United States
This is a block of text. Double-click this text to edit it.
Phone : +1 617-821-3639
IP Fabric s.r.o.
Kateřinská 466/40
Praha 2 - Nové Město, 120 00
Czech Republic
This is a block of text. Double-click this text to edit it.
Phone : +420 720 022 997
IP Fabric UK Limited
Gateley Legal, 1 Paternoster Square, London,
England EC4M 7DX
This is a block of text. Double-click this text to edit it.
Phone : +420 720 022 997
IP Fabric, Inc. © 2024 All Rights Reserved