Every network is growing, every network is becoming more complex due to all the requirements coming from the technology evolution or customer needs, e.g. building the more reliable network connections, application-based routing, etc. And all these facts are the triggers for all the changes we are facing almost every day. Of course, there are small (not so important changes) but there are also complex network changes including core devices replacement, design changes impacting the business, and emergency changes that are caused by a failure …

I was managing small networks of units of devices, larger networks consisted of tens or hundreds of network devices but also the enterprise networks with thousands of routers, switches, firewalls and many other different network appliances and devices. I want to share some of my best practices from experience with all the changes I had prepared and executed.

Network change process

There should be a reason for every network change, but let’s exclude the emergency changes as it’s mostly about “how to fix the network so that the business can go on ASAP”. I want to focus on the planned changes for which we can prepare for properly — the best way is to have the change management process in place, including the CAB (Change Approval Board) that consists of technical reviewers and these are able to check the whole change preparation with the proper description, commands and devices to be implemented or installed, and also check the testing sequence that is needed to prove that the change is successful.

Ideal way is to follow these point to ensure we know what to do before we touch the network.

  • prepare all the commands that needed to be executed during the change window (usually with approved downtime) and all the important information about the devices we touch during the change (hostname, dns name, management IP address etc.)
  • prepare all the testing including the pre-checks (they’re executed before we start the change), continuous-checks to ensure everything is working as expected, and also the post-checks to be able to compare with the pre-checks and prove the network change is successful and to have a possibility to check the behavior of the changed part of network later (during the troubleshooting — next day, week after the change etc.)
  • prepare the rollback plan to be able to revert the whole configuration in case of any issue that is too hard to fix in time

In case of the complex change it is better to divide the whole activity into the smaller pieces and prepare the tests for all these sub-tasks. It should avoid the complex troubleshooting during the maintenance window and review the commands we applied from the very beginning of the change …

Small change

Let’s imagine we need to implement small change caused by the change of the ISP’s PtP network on our MPLS connection. It means the only commands we need to implement is to modify the IP address on the WAN interface and modify the next-hop within our static route. So what to check? What kind of testing method should we choose?

I recommend to create a set of verification commands so they’re not based on a specific change. Just to have a possibility to check the health of the device, do a snapshot of the different tables to have a chance to prove there is some issue right before we start the change — e.g. we don’t have a connection to the corporate FTP server, the external connectivity is not working, the log shows some issue not related to our change etc. Based on the device type we should gather the information about the uptime, last reload reason, the logs from the device, NTP status, routing table, mac-address table, interface stats, ARP table, multicast table(s), etc. Then start with the verification the subject of the change — using ping and traceroute is the most usual way to check the routing — ping the ISP IP, ping any other off-site resource, ping some Internet IP (personally, I’m using the public DNS IPs — e.g. 8.8.8.8, 4.4.2.2 …) and also trace to these IPs. Then, check the specific command related to this change — interface configuration and default route information.

And most important fact is the reason why we are configuring the network services — it’s not just for fun, just to have the router/switch/firewall with the Internet connection, but mostly because it’s serving the customer to have a set of services they need for the business (except the lab environment). Therefore the most critical testing is the application testing procedure from the customer side and it should be executed directly by the customer or onsite user to confirm it’s working as expected. For basic pre-check see below

!
show clock
show version
show run
!
show ip int brie
show int description
show int e0/0
show int e0/0.3733
show ip route
show ip route 0.0.0.0
sh ip proto sum
sh ip ospf
sh ip ospf neighbors
sh ip ospf data
sh ip nat translation
sh ntp status
show log
!
!!! ping the ISP interface (old and new)
!
ping 10.73.254.1 rep 100 size 1024
ping 10.73.200.1 rep 100 size 1024
!
!!! ping the off-site resource
!
ping 10.0.20.20
!
!!! traceroute to the off-site resource
!
traceroute 10.0.20.20 prob 1 time 1 num
!
!!! ping the Internet IP
!
ping 8.8.8.8
!

The change activity should start with running the verification commands, then change the IP and default route in this case and running the same verification command set. I’m asking customer to run their own verification after I’m done with my portion of verification to avoid some high traffic load etc. Then, using some comparison tool to review the PRE and POST states:

From the network perspective — the change has been implemented successfully, all the checks are as expected, the routing is working fine and we should ask the customer to do their tests — their own pings, traceroutes, application testing — e.g. working with the emails, VoIP services, core apps like SAP, web services, FTP services, etc.

Complex change

Let’s review all the steps required to proceed with this simple change — change one IP address. Analyze the risk and impact of that change, prepare the verification procedure, prepare the commands needed to implement the change, run all the checks, apply all the commands, run all the checks once again and compare the results. And now, add the some complexity to the trivial network change.

What about if the replacement of the core network device is needed? It means more preparation work and also more attention to the whole process … what about if the site-local dynamic routing protocol is planned to be replaced, e.g. replace the old RIP with the newer OSPF. It’s not about one step, about one or two commands and one or two ping/traceroute verification. I will go through the previously described process more deeply and will bring some additional details to each step.

1. Prepare all the commands

We should be prepared as much as possible and should have all the commands we will apply on all various devices to save the time during the change window. Many complex changes have to be planned very carefully and it is better to split the whole change to several stages with the specific command set to be applied

We should be prepared as much as possible and should have all the commands we will apply on all various devices to save the time during the change window. Many complex changes have to be planned very carefully and it is better to split the whole change to several stages with the specific command set to be applied

These “big actions” are mostly coordinated with the customer therefore it is good to define some milestones within the implementation plan and discuss the progress among the whole team. I will cover the rollback plan later in this chapter.

It is a definitely most critical part of that change process. Not from the functional point of view but from the change management process perspective. It must be proven the network is working at 100% and all the types of communications have to be tested and all the parameters verified.

It is still very important to have all the information we might need in one place — IP addressing of all devices, access credentials incl. local user access, topology diagrams, etc. It is also important to double check the order of the commands we will apply line by line — we must be very careful not to apply any command it will cut the device off. Personally, I caused it many times in the past — I applied the wrong access-list, I didn’t allow the correct VLAN on the uplink trunk, I configured the wrong IP address on the WAN interface, and did many other things… so if it is possible, it is good to use scheduled rollback or reload of the device to revert back the last known configuration (e.g. Cisco’s command reload in) …

2. Prepare all the verifications

It is a definitely most critical part of that change process. Not from the functional point of view but from the change management process perspective. It must be proven the network is working at 100% and all the types of communications have to be tested and all the parameters verified.

The big checks (PRE/POST and after the specific stage) should include the global verification — availability, path parameters and should be automated (will be covered later in text) and the small verification boxes represent just “local” testing after the specific commands are executed (e.g. verify the routing protocol is up and running, the ACL matches the trafrfic, the QoS marking is working etc.).

Most of the network devices we are configuring have two sides — the firewall is using inside and outside interfaces, the router is using WAN and LAN interfaces, the switch is connecting the hosts or other switches towards the core etc. Based on this fact, I recommend to use three different types of testing method:

a) checking connectivity TO THE DEVICE

It means we should check the availability of the configured device from the core or from the downstream direction. We can check the online/offline state with the ping tool and we can check the incoming direction with the traceroute command. The IP address can be the management loopback, WAN IP address or any other IP important for the communication with this device.

b) checking connectivity FROM THE DEVICE

It is very important to have a list of corporate resources that should be available throughout the whole network — FTP server, NTP server, VPN hub, … and the best way how to check this part is to ping all these resources (IPs) and trace the path to them to ensure it’s working as expected.

c) checking connectivity THROUGH THE DEVICE

It may happen that the network is not working even the previous two checks are OK. Why? Because the traffic shouldn’t pass the device correctly — there could be a problem with the advertisement of the local prefixes, the issue with the redistribution, the L2 switch VLAN ACL or mismatched filtering within the FW ACL. So pinging the core or corporate resources from the LAN (or inside) AND pinging the LAN/inside IPs from the core is a must in this case to ensure both directions are OK.

3. Prepare the rollback plan

As mentioned before, milestones or check points should be defined to have a possibility to decide whether to continue or not with the network change (go-no-go decision). They are usually after verification at every stage:

These three milestones from our example can be described as follows:

  • Milestone 1 — all the involved people are ready to proceed, incl. onsite support, remote support, PM if needed etc. All the devices are reachable, downtime was approved and customer has agreed with the start of that change.
  • Milestone 2 — all the post stage checks are successful, all the applications are working fine (if applicable) and current time is according the planed Time T2. If the change doesn’t go smoothly and the team is under the time pressure, it is standard part of the change process to discuss the next steps — to continue and try to catch the next milestone in time, or to roll back and revert all the configuration to the starting position and cancel the change.
  • Milestone 3 — all the post change checks are successful and all the team members agree on the result of the change. If there is any kind of issue, the change executor should be able to proceed with the rollback plan and revert back the configuration to the original state (before the change started) or provide any kind of troubleshooting if the current time is less than Time T3.

The rollback plan should contain the steps for all the devices that are being touched during the change window — re-cabling instructions and command set for each milestone and its own verification procedure. The global post-check verification is executed after the rollback is completed and comparison with the pre-check output is mandatory.

Verification tasks automation

The amount of the verification is growing with the complexity of the change and therefore we should try to save the energy by using some automation tools. I omit the commercial tools for monitoring the network resources and want to describe some various possibilities I’ve encountered during my work.

I will use just three verification commands they’re implemented on all OS I have met — ping to check if the destination IP is available or not, traceroute to check the exact path between the source and the destination IP and telnet to check various L4 ports. If I use the same differentiation of the verification directons as described above, I will talk about only two directions from that list only — verification executed on the device that is being changed and from any other location towards this device.

1. Verification from the impacted device

It’s easier to execute the verification procedure from the impacted device as we are still at one prompt and only challenge is to process the command by command.

Cisco’s easiest way is to use the TCL script to run the verification commands by using native commands mentioned above. It can be stored as a function and called any time during the change window or it can be processed just once as we need it.

Let’s demonstrate how it works and the difference between the various possibilities. Let’s have three IP addresses we will use for our verification from one device — 10.0.20.1, 10.0.20.20 and 10.0.20.21. I have prepared the set of verification commands:

!
ping 10.0.20.1 rep 2
ping 10.0.20.20 rep 2
ping 10.0.20.21 rep 2
!
traceroute 10.0.20.1 prob 1 time 1 num
traceroute 10.0.20.20 prob 1 time 1 num
traceroute 10.0.20.21 prob 1 time 1 num
!

Of course I can use standard copy&paste function but the problem is processing the input via the TTY session. Next exhibit shows the result when using copy&paste — it can be seen one ping command is missing (even it was within the pasted text) and two traceroute commands are also missing …

L73EXR1#ping 10.0.20.1 rep 2
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 10.0.20.1, timeout is 2 seconds:
!!
Success rate is 100 percent (2/2), round-trip min/avg/max = 4/4/5 ms
L73EXR1#ping 10.0.20.20 rep 2
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 10.0.20.20, timeout is 2 seconds:
!!
Success rate is 100 percent (2/2), round-trip min/avg/max = 5/11/18 ms
L73EXR1#traceroute 10.0.20.21 prob 1 time 1 num
Type escape sequence to abort.
Tracing the route to 10.0.20.21
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 1 msec
3 10.31.254.2 [AS 21] 3 msec
4 *
5 10.31.47.115 [AS 31] 2 msec
6 10.31.49.117 [AS 31] 1 msec
7 10.0.11.254 [AS 31] 4 msec
8 10.0.20.21 [AS 31] 5 msec

We need ensure that every command is pasted/processed once the prompt is ready. With Cisco TCL script we can modify the verification set as follows:

foreach address {
10.0.20.1
10.0.20.20
10.0.20.21} { ping $address rep 3
}

It means we can enter the TCL mode by typing tclsh command and paste the text above. It will process command by command as needed. Then, we can expand the TCL script and add the traceroute or any other Cisco command with the IP address as an argument. To make it simpler, it’s possible to create the procedure called any name, put all the verification into the brackets and then, it can be called any time from the tclsh prompt. The example of our verification set:

proc CheckFROMonce {args} {
foreach address {
10.0.20.1
10.0.20.20
10.0.20.21} { ping $address rep 3
}
foreach address {
10.0.20.1
10.0.20.20
10.0.20.21} { trace $address prob 1 time 1 num
}
}

Then, let’s call the procedure once:

L73EXR1(tcl)#CheckFROMonce
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 10.0.20.1, timeout is 2 seconds:
!!!
Success rate is 100 percent (3/3), round-trip min/avg/max = 4/6/8 ms
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 10.0.20.20, timeout is 2 seconds:
!!!
Success rate is 100 percent (3/3), round-trip min/avg/max = 2/8/14 ms
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 10.0.20.21, timeout is 2 seconds:
!!!
Success rate is 100 percent (3/3), round-trip min/avg/max = 2/5/10 ms
Type escape sequence to abort.
Tracing the route to 10.0.20.1
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 3 msec
3 10.31.254.2 [AS 21] 3 msec
4 *
5 10.31.47.115 [AS 31] 7 msec
6 10.31.49.117 [AS 31] 5 msec
7 10.0.11.254 [AS 31] 5 msec
8 10.0.20.1 [AS 31] 4 msec
Type escape sequence to abort.
Tracing the route to 10.0.20.20
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 1 msec
3 10.31.254.2 [AS 21] 7 msec
4 *
5 10.31.47.115 [AS 31] 7 msec
6 10.31.49.117 [AS 31] 6 msec
7 10.0.11.254 [AS 31] 9 msec
8 10.0.20.20 [AS 31] 3 msec
Type escape sequence to abort.
Tracing the route to 10.0.20.21
VRF info: (vrf in name/id, vrf out name/id)
1 10.73.254.1 1 msec
2 10.31.254.1 [AS 21] [MPLS: Label 129 Exp 0] 2 msec
3 10.31.254.2 [AS 21] 5 msec
4 *
5 10.31.47.115 [AS 31] 2 msec
6 10.31.49.117 [AS 31] 2 msec
7 10.0.11.254 [AS 31] 3 msec
8 10.0.20.21 [AS 31] 3 msec
L73EXR1(tcl)#

In case we cannot use the TCL or we don’t want to use it for any kind of reason, we can use own script based on the tool or scripting language we are using — it can be PERL or PYTHON script with direct SSH/Telnet module or it can be served by VBScript within the SecureCRT client. The easiest VBScript without loading external files (e.g. device list or command set) is as follows:

#$language = "VBScript"
#$interface = "1.0"
crt.Screen.Synchronous = True
sub Main
  crt.Session.Log False
crt.Session.LogFileName = "logs\CheckFROMonce.log"
crt.Session.Log True
  crt.Screen.Synchronous = True
  crt.Screen.Send "ping 10.0.20.1 rep 2" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "ping 10.0.20.20 rep 2" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "ping 10.0.20.21 rep 2" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "trace 10.0.20.1 prob 1 time 1 num" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "trace 10.0.20.20 prob 1 time 1 num" & vbcr
crt.Screen.WaitForString "# "
crt.Screen.Send "trace 10.0.20.21 prob 1 time 1 num" & vbcr
crt.Screen.WaitForString "# "
  If crt.Session.Logging Then      
crt.Session.Log False
End If
  crt.Sleep 500
MsgBox "DONE: Generic Scripting"
end Sub

There is a lot of options how to proceed with this type of task but the aim is the same for all methods — save the time by pasting the commands, avoid the human error by using one verification procedure in all the verification steps and to have the same-formatted log to be able to compare the results any time during or after the change.

2. Verification to or through the impacted device

A little bit different scenario is the second part of verification and it’s testing the function of the impacted device for the traffic that is destined to the device or is passing through the device.

To fulfill this task we should login to the various devices and execute some specific set of command. The command set can be equal to check the same connectivity check from various network sources or additionally, it can be different and test different services or destinations.

I recommend to include these verification:

  • Test connectivity from the different network sources to the management IP address of the impacted device (ping/trace).
  • Test connectivity from the different network sources located outside or WAN side and it’s destined to the LAN/inside IPs to check the advertisement of the local prefixes (ping/trace/telnet).
  • Test connectivity from the LAN/inside to the WAN/outside to the various network destinations to check the routing through that device.

The automation of these verification steps can be done by using some scripting language and preparing the specific device-list as an input file and executing the verification command set from these devices step by step.

The simple BASH script that should be located and executed somewhere in the network far from the change location could be used to monitor the local resources incl. the management IP, local prefixes etc.

#!/bin/bash
HOSTS="10.0.20.73 10.73.255.1"
C=2
while true
do
for conHost in $HOSTS
do
result=$(ping -c $C $conHost | grep 'received' | awk -F',' '{ print $2 }' | awk '{ print $1 }')
if [ $result -eq 0 ]; then
echo "Host $conHost is down! at $(date)"
fi
done
done

We can specify as many hosts as needed and check the connection state — if the ping result is OK (the IP is reachable), it’s showing no output and if there is an issue with the reachability, it will display the information the host is down and the date/time detail:

trehor@labremote:~$ ./monitor.sh  
Host 10.73.255.1 is down at Thu Jul 26 06:39:41 CEST 2018
Host 10.73.255.1 is down at Thu Jul 26 06:39:53 CEST 2018
Host 10.73.255.1 is down at Thu Jul 26 06:40:05 CEST 2018
^C
trehor@labremote:~$

Because of the infinite loop within the BASH script, you must terminate the monitoring script by pressing CTRL+C.

In my opinion, the verification is very important part of the network changes regardless it’s a simple network change (changing the next-hop of the static route) or very complex change (core router replacement, local dynamic routing protocol replacement, adding the second router with the secondary connection etc.) or impacting one device or including 10+ devices. It must be prepared and reviewed in advance to avoid missing tests and mostly, we should collect as much as possible from the testing logs and be able to compare the PRE/POST outputs to prove the network behavior is the same after we finish the network change as it was before we started. And sometimes — these logs can help us with the troubleshooting several days after the change …

If you’re interested in learning more about how IP Fabric’s platform can help you with analytics or intended network behavior reporting, contact us through our website, request a demo, follow this blog or sign up for our webinars.