>

Blog

>

Mind the gaps in your network tooling!

Intent Verification | Network Automation

Mind the gaps in your network tooling!

Daren Fulwell

6 minute read

Home

>

Blog

>

Mind the gaps in your network tooling!

Updated: October 27, 2023

Select Chapter

Daren Fulwell

November 30, 2020

Updated: October 27, 2023

6 mins

The network is a distributed system whose raison d'etre is to deliver applications to your users in a reliable and timely way. In order to keep your network up and functioning, manageable and supportable, adaptable and secure, you need to maintain a set of tools which take care of the myriad of different elements and vendors' platforms.

But how do you know the tools don't have conflicting views of the network devices they are responsible for? Are you sure they are all kept up to date and the information in them is accurate? Are there deeper questions about the network that you need the answers to and your tools just can't give you an answer?

IP Fabric can be used to enrich all of your network operations tooling. In this first of two parts, we'll take a look at just what comprises that ecosystem - and where the gaps are that IP Fabric can fill!

The tooling ecosystem

In order to ensure your network is maintained and well supported, you need to make use of tooling in a number of areas:

Configuration
Performance Monitoring
Event Management
IT Service Management
CMDB

Each of these areas can break down further into toolsets used by a range of different folks in order to maintain network service.

Configuration

Even in these days where network automation is de rigeur, a large majority of configuration changes are still made manually through device command line interfaces (CLI). It is typical though to use Configuration Management tooling to at least back up that configuration when changes are made. This allows simple replacement of a device should one fail, and the ability to track change activity over time.

Over time though, configuration of network devices is becoming more automated, typically through use of

Scripting - use of shell scripts, programmes written in a language such as Python, or platforms like Ansible to carry out bulk configuration changes and test them;
Software-Defined Networking - where a controller looks after configuration of individual devices, and you centrally administer template configuration and variables via API or GUI;
Policy Engines - where individual devices are configured to retrieve policy information from a central engine in order to render active configuration.
Orchestration platforms - which allow us to combine automation elements into workflows to achieve more complex tasks across a number of configuration domains.

In order to get a full view, multiple tools may be required across a network, due to the different network domains and selection of vendors within an environment. Ideally they should all be synchronised in some form to ensure complete visibility and coverage end-to-end.

Performance Monitoring

Once a network has been deployed, it is necessary to ensure it is monitored - that the performance of the components is tracked and reported on, to ensure that capacity limits aren't reached and performance thresholds aren't breached. These typically address three main areas:

Device - tracked typically using SNMP, the goal is to ensure that we quantify the health of the devices using such measurements as CPU and memory utilisation, internal temperature, and indeed whether the management plane of the device is even reachable!
Link - again, typically our monitoring platform will sample interface state over time and generate a utilisation measurement for the interfaces in question. These figures will typically be graphed and trends predicted over longer time periods to help with capacity planning.
Application - typically using one of two methods. Either synthetic probes are sent into the network to simulate the behaviour of an application (perhaps an http request to a given URL) and measurements taken. Or actual traffic flows (capture using Netflow, S-Flow or similar) are analysed and measured to give a realistic view of application performance.

More modern approaches to gathering performance data include streaming telemetry. A data collection platform subscribes to certain types of performance data from a publisher (typically a network device or controller). The data is then sent directly without a need to poll for it, meaning that it arrives at the collector in something approximating real time.

Performance monitoring can only give a view of individual device, link or applications. Again, multiple tools may be required due to mixes of vendors and capabilities, and it may be difficult to offer a meaningful aggregated view. Maintaining end-to-end visibility and consistency is once more key.

Event Management

Alongside performance monitoring, systems to collect information about events occurring in the network are key, in order to provide a responsive service. There are a number of mechanisms for collection of event data, including:

SNMP traps - a "traditional" approach whereby if a device detects an event which needs to be notified, it sends a "trap" to a collector. The trap is logged and displayed on a dashboard.
Syslog - another well-worn mechanism, used by all elements of IT infrastructure whereby a system creates log entries for events that occur on a device. Those entries can either be stored locally or sent across the network to a collector.
Webhooks - a newer approach, typically used by API-driven infrastructure platforms or controllers. If an event occurs in the device or system, an http request is sent to a given web server which queues up the notifications. A handler can then pull webhook notifications from the queue and act on them accordingly, typically using a script to pull more detail about the event from the device or system concerned.

From a network-wide perspective, the key element to event management is Correlation. An incident in the network has the potential to generate events on a number of devices and as such may appear as a number of SNMP traps, log entries or webhooks in the collection engine. An Event Correlation engine (such as a SIEM - Security Incident Event Management platform) will combine those notifications and attempt to derive a root cause from them.

It is important to be sure that all elements in the network are contributing event data. This is important particularly when you are using correlation to determine root cause of issues. Traditional tooling doesn't verify this, we need to combine a picture of what is really in the network with whether it is correctly configured.

IT Service Management

Books have been written on this topic alone, so I shan't delve too deeply, but there are a few key areas in service management which really impact day to day network operations. For example:

Incident Management - this is the collection of processes which a service desk uses to triage, analyse and troubleshoot issues as they arise. Event Management and Performance Monitoring tooling would be used to proactively trigger these processes. Typically, manual incident tickets are raised by users when they are faced with an issue and so the ticketing systems drive the processes.
Problem Management - when a collection of incidents can be grouped to indicate that there is a larger scale (or longer term) problem. These processes rely on the same tooling as Incident Management, but also heavily on documentation, project delivery and so on.
Change Management - when changes are required to be made to the network, it is necessary to assess impact both of making and not making the changes. Documentation is leaned on heavily here, along with the experiences of the network engineers. It is also key to understand the dependency of applications on the different elements of the network.

It is clear that for ITSM processes, the more information about interaction of application service with the infrastructure that supports it is best - difficult to derive from simple monitoring platforms. Documentation also needs to be kept updated to support pre- and post-change validation.

CMDB

The CMDB - or Configuration Management Database - is a bit of a misnomer as far as network support is concerned. It isn't concerned with management of device configurations, but with being a definitive inventory of all the items which make up the network infrastructure. It is used by all areas of the business which need access and visibility to that information, and includes:

Physical inventory - the devices, modules and parts themselves, serial numbers and locations.
Software licence data - related to which devices, feature sets, duration and expiry, costs.
Support contract information - such as vendor, serial numbers covered, level of coverage (eg 9x5 same business day replacement, 24x7 four hours replacement with installation), period of cover.
Circuit rental - service provider, A and B end locations, support coverage, annual charges, duration, cancellation information.

These elements are not directly related to the operation of the network infrastructure itself (although licensing is becoming more so) but are key to ensuring that support runs smoothly over time. Too often manual processes are used to keep data in sync with these operational platforms.

Next Time

You can see that in order to operate networks, a whole web of tooling is required to maintain a good level of visibility and control. Importantly, the significant gaps between them need to be filled. In Part Two of this mini-series, we'll show how IP Fabric can be used to answer questions that you simply can't answer with the traditional tooling we describe here. And we will look at how we can enrich the data in those tools with information from IP Fabric!

Mind the gaps in your network tooling!

The network is a distributed system whose raison d'etre is to deliver applications to your users in a reliable and timely way. In order to keep your network up and functioning, manageable and supportable, adaptable and secure, you need to maintain a set of tools which take care of the myriad of different elements and vendors' platforms.

But how do you know the tools don't have conflicting views of the network devices they are responsible for? Are you sure they are all kept up to date and the information in them is accurate? Are there deeper questions about the network that you need the answers to and your tools just can't give you an answer?

IP Fabric can be used to enrich all of your network operations tooling. In this first of two parts, we'll take a look at just what comprises that ecosystem - and where the gaps are that IP Fabric can fill!

The tooling ecosystem

In order to ensure your network is maintained and well supported, you need to make use of tooling in a number of areas:

Configuration
Performance Monitoring
Event Management
IT Service Management
CMDB

Each of these areas can break down further into toolsets used by a range of different folks in order to maintain network service.

Configuration

Even in these days where network automation is de rigeur, a large majority of configuration changes are still made manually through device command line interfaces (CLI). It is typical though to use Configuration Management tooling to at least back up that configuration when changes are made. This allows simple replacement of a device should one fail, and the ability to track change activity over time.

Over time though, configuration of network devices is becoming more automated, typically through use of

Scripting - use of shell scripts, programmes written in a language such as Python, or platforms like Ansible to carry out bulk configuration changes and test them;
Software-Defined Networking - where a controller looks after configuration of individual devices, and you centrally administer template configuration and variables via API or GUI;
Policy Engines - where individual devices are configured to retrieve policy information from a central engine in order to render active configuration.
Orchestration platforms - which allow us to combine automation elements into workflows to achieve more complex tasks across a number of configuration domains.

In order to get a full view, multiple tools may be required across a network, due to the different network domains and selection of vendors within an environment. Ideally they should all be synchronised in some form to ensure complete visibility and coverage end-to-end.

Performance Monitoring

Once a network has been deployed, it is necessary to ensure it is monitored - that the performance of the components is tracked and reported on, to ensure that capacity limits aren't reached and performance thresholds aren't breached. These typically address three main areas:

Device - tracked typically using SNMP, the goal is to ensure that we quantify the health of the devices using such measurements as CPU and memory utilisation, internal temperature, and indeed whether the management plane of the device is even reachable!
Link - again, typically our monitoring platform will sample interface state over time and generate a utilisation measurement for the interfaces in question. These figures will typically be graphed and trends predicted over longer time periods to help with capacity planning.
Application - typically using one of two methods. Either synthetic probes are sent into the network to simulate the behaviour of an application (perhaps an http request to a given URL) and measurements taken. Or actual traffic flows (capture using Netflow, S-Flow or similar) are analysed and measured to give a realistic view of application performance.

More modern approaches to gathering performance data include streaming telemetry. A data collection platform subscribes to certain types of performance data from a publisher (typically a network device or controller). The data is then sent directly without a need to poll for it, meaning that it arrives at the collector in something approximating real time.

Performance monitoring can only give a view of individual device, link or applications. Again, multiple tools may be required due to mixes of vendors and capabilities, and it may be difficult to offer a meaningful aggregated view. Maintaining end-to-end visibility and consistency is once more key.

Event Management

Alongside performance monitoring, systems to collect information about events occurring in the network are key, in order to provide a responsive service. There are a number of mechanisms for collection of event data, including:

SNMP traps - a "traditional" approach whereby if a device detects an event which needs to be notified, it sends a "trap" to a collector. The trap is logged and displayed on a dashboard.
Syslog - another well-worn mechanism, used by all elements of IT infrastructure whereby a system creates log entries for events that occur on a device. Those entries can either be stored locally or sent across the network to a collector.
Webhooks - a newer approach, typically used by API-driven infrastructure platforms or controllers. If an event occurs in the device or system, an http request is sent to a given web server which queues up the notifications. A handler can then pull webhook notifications from the queue and act on them accordingly, typically using a script to pull more detail about the event from the device or system concerned.

From a network-wide perspective, the key element to event management is Correlation. An incident in the network has the potential to generate events on a number of devices and as such may appear as a number of SNMP traps, log entries or webhooks in the collection engine. An Event Correlation engine (such as a SIEM - Security Incident Event Management platform) will combine those notifications and attempt to derive a root cause from them.

It is important to be sure that all elements in the network are contributing event data. This is important particularly when you are using correlation to determine root cause of issues. Traditional tooling doesn't verify this, we need to combine a picture of what is really in the network with whether it is correctly configured.

IT Service Management

Books have been written on this topic alone, so I shan't delve too deeply, but there are a few key areas in service management which really impact day to day network operations. For example:

Incident Management - this is the collection of processes which a service desk uses to triage, analyse and troubleshoot issues as they arise. Event Management and Performance Monitoring tooling would be used to proactively trigger these processes. Typically, manual incident tickets are raised by users when they are faced with an issue and so the ticketing systems drive the processes.
Problem Management - when a collection of incidents can be grouped to indicate that there is a larger scale (or longer term) problem. These processes rely on the same tooling as Incident Management, but also heavily on documentation, project delivery and so on.
Change Management - when changes are required to be made to the network, it is necessary to assess impact both of making and not making the changes. Documentation is leaned on heavily here, along with the experiences of the network engineers. It is also key to understand the dependency of applications on the different elements of the network.

It is clear that for ITSM processes, the more information about interaction of application service with the infrastructure that supports it is best - difficult to derive from simple monitoring platforms. Documentation also needs to be kept updated to support pre- and post-change validation.

CMDB

The CMDB - or Configuration Management Database - is a bit of a misnomer as far as network support is concerned. It isn't concerned with management of device configurations, but with being a definitive inventory of all the items which make up the network infrastructure. It is used by all areas of the business which need access and visibility to that information, and includes:

Physical inventory - the devices, modules and parts themselves, serial numbers and locations.
Software licence data - related to which devices, feature sets, duration and expiry, costs.
Support contract information - such as vendor, serial numbers covered, level of coverage (eg 9x5 same business day replacement, 24x7 four hours replacement with installation), period of cover.
Circuit rental - service provider, A and B end locations, support coverage, annual charges, duration, cancellation information.

These elements are not directly related to the operation of the network infrastructure itself (although licensing is becoming more so) but are key to ensuring that support runs smoothly over time. Too often manual processes are used to keep data in sync with these operational platforms.

Next Time

You can see that in order to operate networks, a whole web of tooling is required to maintain a good level of visibility and control. Importantly, the significant gaps between them need to be filled. In Part Two of this mini-series, we'll show how IP Fabric can be used to answer questions that you simply can't answer with the traditional tooling we describe here. And we will look at how we can enrich the data in those tools with information from IP Fabric!

SHARE

Demo

Try out the platform

Test out IP Fabric’s automated network assurance platform yourself and be inspired by the endless possibilities.

What would this change for your network teams?

Start live demo

We're Hiring!

Join the Team and be part of the Future of Network Automation

Available Positions

98 North Washington Street
Suite 407
Boston, MA 02114
United States

This is a block of text. Double-click this text to edit it.

Phone : +1 617-821-3639

IP Fabric s.r.o.
Kateřinská 466/40
Praha 2 - Nové Město, 120 00
Czech Republic

This is a block of text. Double-click this text to edit it.

Phone : +420 720 022 997

IP Fabric UK Limited
Gateley Legal, 1 Paternoster Square, London,
England EC4M 7DX

This is a block of text. Double-click this text to edit it.

Phone : +420 720 022 997

Subscribe to get the latest news!

Support & Documentation

Legal

Are you affected by CVE-2024-3400?

DORA requires proving operational resilience in your network

PCI Compliance aided by Network Assurance; Conquer PCI DSS v4.0

Mind the gaps in your network tooling!

Select Chapter

The tooling ecosystem

Configuration

Performance Monitoring

Event Management

IT Service Management

CMDB

Next Time

Mind the gaps in your network tooling!

The tooling ecosystem

Configuration

Performance Monitoring

Event Management

IT Service Management

CMDB

Next Time

Try out the platform

HQ Office in Boston

Office in Prague

Office in London

Solution

Your Self-Driving Network

Resources

Community

Partners

Company

Support

Solution

Your Self-Driving Network

Resources

Community

Partners

Company

Join our Newsletter!

Are you affected by CVE-2024-3400?

DORA requires proving operational resilience in your network

PCI Compliance aided by Network Assurance; Conquer PCI DSS v4.0

Mind the gaps in your network tooling!

Select Chapter

The tooling ecosystem

Configuration

Performance Monitoring

Event Management

IT Service Management

CMDB

Next Time

Mind the gaps in your network tooling!

The tooling ecosystem

Configuration

Performance Monitoring

Event Management

IT Service Management

CMDB

Next Time

Subscribe to our Newsletter

Try out the platform

HQ Office in Boston

Office in Prague

Office in London

Solution

Your Self-Driving Network

Resources

Community

Partners

Company

Support

Solution

Your Self-Driving Network

Resources

Community

Partners

Company

Join our Newsletter!