Open Telemetry (OTel) is a networking protocol that has been under development for four years and recently received a boost from F5 Inc. and Service Now. Cited as a protocol that will help advance the cause of observability, it's interesting to see where and how this protocol will develop over the coming months and years.
In the meantime, it got us thinking about network telemetry in general. What exactly is it? Is it related to network monitoring? Most importantly, how does it relate to Network Assurance?
Network telemetry concerns the collection, measurement and analysis of data related to the behavior and performance of the network. Telemetry gathers information on routers, switches, servers and applications to gain insights into how they function, and how data moves through them. To provide an analogy, network telemetry checks the network's pulse to track health and performance.
Hang on a second... This sounds an awful lot like network monitoring. Well, kind of yes.
Both monitoring and telemetry platforms provide information on network health and performance in real- (or near real-) time.
Monitoring is a broader term that encompasses the overall observation of a network's performance, health and activity through pre-defined parameters, whereas telemetry refers to a more automated and continuous data collection process on more granular data like packet loss and latency.
The data collected by both monitoring and telemetry platforms is essentially the same, with the main distinction being that telemetry focuses on more granular details and data. Another difference is in how the data is presented in their respective platforms.
Network monitoring relies on a process called "polling" - a process where monitoring tools request data from network devices. This can unfortunately lead to visibility blind spots, specifically when issues occur during polling phases, as such issues can go undetected during this process.
Furthermore, polling can potentially disrupt activity in the resource being monitored. For example, a router passing traffic along an end-to-end path. When running a query asking for 1000 data points from said router, each request and data point is queried and extracted INDIVIDUALLY, with 1000 responses being sent from the router in return. This is clearly an inefficient process.
Inefficient, especially when you add in the fact that querying a resource creates additional traffic within the network path, with the monitoring queries capable of impacting the behavior of the device that data is being gathered from. Back to our example - querying the router requires it to stop it's current activity of passing traffic to respond to the monitoring queries. They can't multi-task as well as some of us.
Network telemetry, conversely, does not suffer from this issue. Telemetry focuses on the continuous and real-time collection of network data through streaming, not polling. An engineer simply needs to tell the network device what data they're interested in, and then as the device operates, data is streamed to the telemetry platform. This means there's less chance of impacting the performance of the queried device, as it doesn't have to stop passing traffic to handle management requests - ultimately leading to quicker detection of issues and their resolution.
How does telemetry fit into the concept of network assurance?
When running a monitoring or telemetry platform without network assurance, you need a knowledgeable and experienced engineer to sit down and understand the information presented by these platforms. They have to manually determine what data is important and what to do when it goes over a threshold or boundary that they have set. Without this engineer spending a considerable amount of time setting up these thresholds, a monitoring or telemetry platform will just collect data and send alerts without context - creating alert fatigue for the rest of the engineering team.
Assurance perfectly fits this gap by providing the necessary context that teams need when an alert is sent from a telemetry platform.
Take the example of packet loss or latency. Network telemetry platforms will collect this information and make it available to engineers. Now they know there's an issue. What they don't know, however, is how this packet loss may affect the path or the wider network. With network assurance, teams can develop topologies of their network and data-flow diagrams to identify the end-to-end path that contains the packet loss. Based on this, they can understand the severity of the issue, where it stems from, where it can lead, and most importantly, how they can address it before it develops into a wider issue and begins to affect performance in other devices.
To adapt the phrase we used in our previous article on IP Fabric and Centreon - telemetry goes wide, and assurance goes deep.