
Stephen Collins: Welcome everyone to today's ONUG webinar. Our topic today is Unlocking AIOps with Red Hat and IP Fabric: The Power of Dependency Mapping. I'm your host, Steven Collins. I am the ONUG CTO, and I'll be your moderator today.
We have two speakers. We have Martin Moucka, who's manager of IT Networks at Red Hat, and we have Vitek Savel, who's the Strategic Channels & Accounts Manager at IP Fabric.
We're gonna have an interesting discussion today about how Red Hat's been working with IP Fabric to do a much better job of getting a handle on what's going on inside of the network infrastructure, responding better to issues and anomalies and incidents within the network, and leveraging this dependency mapping capability to do a much better job of being proactive and automating workflows. It's a really interesting topic, and I think very timely given the state of where networks are today.
Before we dive in though, I want to let everybody know, feel free to ask any questions that may arise during the course of the next half hour or so. You can do that by entering your question via the Q&A tab at any time. We'll probably queue the questions up towards the end, but feel free to ask them at any time, especially if they're really relevant. Maybe we'll just stop and answer it right at that point.
That's a little bit of housekeeping here. I just wanted to help set the stage here a little bit, and since the topic is AIOps and how this is related to AIOps. The interesting thing here is AIOps—that term was coined by Gartner way back in 2016. So in the industry, we've got almost a decade's worth of experience with AIOps solutions.
Back when this first came on the scene, a lot of the AIOps boiled down to machine learning using statistical analysis of time series data, usually processes that were focused on a given domain or a given layer within the network. Some fairly powerful tools were developed, but again, just really zeroing in on a given domain or a given layer.
Then neural networks became commercially viable. So we started doing more actual machine learning on data, and that allowed not just to look backward and do historical views of what was going on in the network, but to use those capabilities to do predictions about what was going on. So there's been a lot of really interesting technology that's been developed. And of course, the amount of data that can now be consumed and analyzed is mind-blowing. Tools have gotten nothing but more powerful at that atomic level.
But what's interesting is as AIOps has progressed in its capabilities, the environments that we're operating in have progressed in terms of complexity and scale. Ten years ago, a lot of folks were probably just running applications within their own data center. Maybe they were starting to dabble with the cloud. If you look at where we are today, you've still got a lot going on in on-premise data centers, but there's also a whole lot that's going on in the cloud and actually in multiple clouds and multiple SaaS applications.
So the whole environment has become much more distributed, much more dynamic, much more complex. It's been a challenge for AIOps techniques to evolve out of that private data center environment into this now highly complex and distributed environment with layers of virtualization.
The interesting thing is, what needs to be done to harness the power of AIOps in these environments? It really boils down to being able to understand and track the relationships between different elements—network elements, devices, elements of the application infrastructure—and the relationships between all those components, and also the relationships between them as they change over time. Some of that can be highly dynamic in these very virtualized application environments that are now becoming commonplace with things like Kubernetes and container-based applications.
We're in a world now where AIOps is not just a nice to have—it's a got to have. What we're going to talk about today with Martin and Vitek is a really important concept called dependency mapping, and how that can be used to keep track of what the relationships are between different components, elements, and layers or domains. And then just do a much better job of keeping track of what's going on, predicting what might be going on, and leveraging the power of automation tools that are now at our disposal.
That's before we even start to apply things like generative AI and large language models to the problems that need to be solved. So we're really, I think, at another inflection point in the way networks and application environments are evolving. Today's topic is particularly relevant to all of that.
With that little bit of a long-winded introduction, I'm going to hand this over to Martin and Vitek and let them run with it.
Martin Moucka: Thank you, Steve. Let me build on the topic. Thank you for the introduction. What we are dealing with at this moment is really the AI paradox. The promise is huge—having proactive, self-healing IT that saves money and time. But eventually, we have a vast amount of data that is overwhelming. As you mentioned, Steve, it's very hard to navigate through the complex infrastructure of virtualization, containers, and systems being very distributed in multiple data centers—on-premises or even in the cloud.
Basically, the core problem at this moment is that teams lack real visibility into the infrastructure. They fail to understand how things are connected, what is dependent on what, and what happens when something breaks or when I do a change somewhere—what am I potentially breaking? That's the biggest question that we have currently.
AIOps is not a magic AI that you just go and plug into an infrastructure until it suddenly understands everything, is able to make all the decisions, and provides full visibility end-to-end. Eventually, you need to provide that context so you can augment your expertise with tooling and data for data-driven decisions, moving from reactive to proactive and eventually to automated remediation.
As I said, you need the context. You need a single source of truth that shows, in a machine-readable format, how things are actually connected, how things are dependent on each other. You have to go from the very bottom layer of infrastructure up to the application stack and individual application resources. One application is not just one VM running somewhere; it's multiple resources that are connecting to each other and exchanging data.
You need to have that context somewhere, and that is where the dependency map comes in. You cannot automate what you don't understand. Dependency mapping helps you connect everything, instantly understand and see the blast radius if there is an outage. So you can have fast or automated root cause analysis with today’s AI agents. With this context that you provide to the AI agent, it can pretty much automate the diagnosis, determine the most likely cause, and suggest or even trigger the appropriate remediation actions.
With all that context, you can have smarter automation where you are really hitting the right playbook for the right target and you get the automated remediation in no time. As I mentioned, you understand what you're potentially breaking when you do a change. So whenever you do change planning, you know your upstream, downstream relationships. You know who to notify, who to talk to. You have stakeholders that are dependent on your piece of infrastructure. So when you do that change, you can clearly articulate what the impact is and have that scheduled so nobody is affected—or you limit that impact.
How we actually started with building the dependency map—now that I talk so much about it—we have a lot of tooling, but we decided to start with a couple of tools. We have some data in ServiceNow, where we have the CMDB storing data about business applications and what forms these applications—VMs, OpenShift, Kubernetes resources, storage, and all that. We have that information, but it's not showing how it's connected down to the networking layer.
We have network automation. We automate the whole network infrastructure and we know how the infrastructure is connected. We have OpenShift, where we run virtualization and containers, and that can provide a lot of context by API. But again, it's not showing the full context. So we chose NetBox, an open-source tool, to be the aggregator of the information.
We wanted to put all the data—from network, ServiceNow, OpenShift, network automation—and connect it all together. But to achieve that, we needed the glue. And that's why we're partnering on this webinar with IP Fabric—because we used IP Fabric as the glue for the data, to connect the dots between application resources, virtualization workers, and the network infrastructure.
IP Fabric, as an automated network assurance platform, allowed us to pull data via API—like ARP, MAC, and other tables—and use that as the glue between what we had on one side and the other. So: application CMDB, network infrastructure, OpenShift. It resulted in a pretty simple script you can use—go to the API of IP Fabric and get all the data. Get the exact edge of where it's connected, what it's connected to, and connect it all into one picture.
Vitezslav Savel: Martin, listening to you is very interesting to me. I've been speaking with large Fortune 500 companies for the past few years, and it seems this issue—operating in silos and having data in multiple places—they've always had it. They always knew it's a problem, and they were always figuring out workarounds. Or it just wasn't being tackled proactively. So I hope now, as AIOps starts picking up, people will finally have a reason to move away from this and really figure out a way to standardize everything.
Martin M.: Yeah, definitely. I agree with you. Silos are a big issue. When you have siloed data and you don't have the overall picture, you cannot achieve anything because you'll have blind spots. You’ll have gaps in the logic—in the correlations that AIOps needs to do.
You have to have some level of standardization, some level of visibility. And the better you standardize, the better visibility you have—from on-prem infrastructure up to the cloud—the better you can do with implementing AIOps.
On the next slide—what is next? When you have the dependency map, what is actually the next step? I mentioned multiple times something like automated remediation. For that, you go from the mapping into the action. For us, the chosen platform was Ansible.
You have an event. You correlate that event to an incident. Then you can make data-driven, smart decisions and trigger event-driven automation to remediate. With that, you’re reducing the toil as much as you can. And you can keep adding to that. Also, you have context for the automation that can resolve things for you. You're moving from reactive to automated—or proactive—eventually.
So that’s the next step. I would summarize it from my side with key takeaways. What you need to do: reframe your goal. Aim for intelligent automation rather than some mythical, all-knowing AI black box that you plug in and everything is instantly solved. AIOps has a lift, that’s for sure—but it is achievable.
You have to start with the foundation. Render it. Building complex automation and AI platforms requires understanding what you're connecting. You have to start with the dependency map. You have to understand what you need to provide as context.
Then, for the data you already have—find your glue. For us it was IP Fabric. Find the glue that can do it for you. When you have that glue, and you have the dependency map, start building on top of it. Automate with context. Then go proactive.
Having automation with context, having automated remediation—that removes toil for you. And with that, you can invest more time in going proactive, going smarter, and using new tools. Nowadays, with the boom of LLMs and AI agents, it’s even easier than before.
Vitezslav S.: Martin, with implementing this in your infrastructure and your teams starting to use it—of course you save time and make things more efficient—but is this unlocking some new interesting processes or things that you weren't doing before?
Martin M.: Certainly. It identifies data quality gaps that we need to address with standardization, as we mentioned. What it opens on the process side: being proactive. Breaking down silos. Identifying circular dependencies—where you have a chicken-egg problem—and you’re creating infrastructure that’s impossible to recover from a failure.
All of that can be done with code and further analysis. Those are the new things it unlocks for us. There are many more—small things here and there—that it unlocks. It’s eye-opening when you see something like this in action, when you see context that you couldn’t see before.
Thank you for that question. Let me open up this slide. Vitek, it’s your time I guess.
Vitezslav S.: It’s my time. What you said as the last thought is a nice intro to what we're doing because it's really about being proactive. And Martin, we've been working together with you and with Red Hat for the last three years. For us, IP Fabric was always used as a baseline for automation.
It’s my time. What you said as the last thought is a nice intro to what we're doing because it's really about being proactive. And Martin, we've been working together with you and with Red Hat for the last three years. For us, IP Fabric was always used as a baseline for automation.
The reason I'm saying this is that most of the people listening right now are probably just thinking about AIOps or maybe just doing the initial research. But IP Fabric is very relevant even if you're not there yet—maybe if you're just thinking to automate, or if you're thinking about observability. All of these new trends have in common that you really need a solid, standardized dataset to make them work.
You have multiple options to approach this. You can start writing your scripts on your own or just collecting very specific data, or get a tool that will do it for you and really accelerate your development. Because you don't need to think about this part—you can immediately get to a point where you have the data and can start working with it.
So, what does this look like? You get one tool which gets information from all of your islands, from all of your silos. I have my AWS here, I have my Azure, my F5 load balancers, my VMware NSX-T, my firewalls. All of this is ready for you out of the box to start working with.
What Martin was talking about with the information about the hosts, about the ARP tables—this is not just a list of devices, but hundreds and hundreds of pre-collected tables that are very relevant to give context to any of the advanced initiatives you may have.
Martin was talking about ARP tables, which I have here. But for those that are not so advanced, usually the first step in automation—or probably the first step in leveraging AI—would be: “Tell me what problems I have.” To understand this, first you need the data, and then you need to put the logic in.
Sometimes, you can even just use IP Fabric. We were just working with a customer who, after running the first discovery and going to this table, noticed, “Why do I have v1 here? We just failed an audit—everything should be v3.” So if you get a use case like this, you can use IP Fabric directly to make your intent check.
Tell IP Fabric: “My intent is, if version is v1, make it red—or in this case, make it green if it's v3 because this is what I want—and make everything else red.” So I can just easily use a tool like this to quickly, proactively validate my network.
This saves me a lot of time because now I can action all of these insights. I can feed them to my AI or automation logic, and I can get results based on this. It really accelerates all of these projects.
While we're at this, we can also do topology maps—which I don't want to spend too much time on today—but if this is relevant, please do reach out to us. We have an online demo available.
Another thing that's very important for context and for understanding the dependencies in your infrastructure: if you have a project where you know you'll be adding a new application, or you need to troubleshoot, you need to understand how the infrastructure actually communicates and how the path looks. Based on all the data we collect, we can show you.
I can say, “Okay, this is my IP A, this is my second IP.” I can be more specific. From the perspective of getting context, I immediately see: this is where the device is connected. It's being blocked by this firewall. And because IP Fabric really is an in-depth data platform, we also have historical views.
So I can go back in time. You see that now I'm looking at the network after the change. And I can say, “How did this look before the change?” Now I can dig deeper, I can evaluate, I can overlay, and immediately see that I don't need to open that firewall because someone actually changed the route here.
All of these things are very relevant to complement the standard reactive tooling that throws alerts at you. But it's very hard to get to the analytical part. And this analytical part is what will be required by all the upcoming AI initiatives to actually give you the value.
Martin M.: Vitek, now that you show that you can have the intent-based checks and comparing the snapshots—I spoke about event-driven automation. So can you fire an event from IP Fabric? So it says, “Hey, this failed,” and you can do an action based on it?
Vitezslav S.: Absolutely, you can. We do support webhooks and there are multiple ways to approach this.
The way we think about IP Fabric is that we want everyone to have this one tool to go to and look at information. But at the same time, we respect the fact that specialized teams have their own dashboards. It can be Grafana, it can be Zabbix, it can be SolarWinds—that they’ve already fine-tuned and have workflows in.
Everything I'm showing can be gathered through API—even this end-to-end. I can get the API documentation right here. Every check. For instance, we have customers who use IP Fabric to validate segmentation. I can say, “If I go to post-change here, I see it's being blocked.” I can say, “This is correct. This should always fail.”
Now, after IP Fabric does a discovery and sees that someone made a bypass around your firewall—just like we've seen in this use case—you will know about it.
I'm conscious of the time because I could keep talking about IP Fabric for a long, long time. We've barely scratched the surface. But I think really the key takeaway is: don’t think about how you can collect data. There are a lot of tools that can already do this for you. They can integrate to your ServiceNow, to your NetBox, or you can just use them on their own. This will save you many months of development.
Stephen C.: Okay. We've got plenty of time here for questions. I would encourage people, if you have any questions, to enter them now in the Q&A tab. I have a couple that I'd like to ask, and we had one that did come in through the chat. But preferred vehicle is to put them in through the Q&A.
Just to start off here—Martin, okay, so you talked about what you've done to date. What's next for Red Hat in terms of IP Fabric data? What do you have planned?
Martin M.: For us, as Vitek was showing, the next step is intent-based checking and plugging that into those signals and evidence.
I mentioned that AIOps is also about making or grouping the individual signals from the infrastructure and applications and everything. So getting those signals in as well is one of the next steps for us.
It's also about getting more out of the API and filling in more data into the dependency map. So when the AI is enriching events that are already there, it has more context of what is running where—not only how things are connected, but what device it is, what version, everything like that.
Stephen C.: Another question is kind of an obvious one actually. If you didn’t have IP Fabric at your disposal, where would that leave you? What would you have to do otherwise in terms of achieving the levels of automation you're trying to get to here?
Martin M.: Good question. We would have to go look for any other tool like that—or develop something on our own—because you still need this level of insight and, well, I will keep calling it the glue for the data that you have from other sources. It was just wiser for us to go with IP Fabric than spending a huge amount of time developing something custom.
Stephen C.: Do you have any idea how long that would take you? How big a project that would be?
Martin M.: Well, depending on the size of the team, but it would certainly take ages. With the variability of platforms that we have, and maintaining everything for each platform, then code changes—you have to change something… yeah, that—I don’t want that.
Vitezslav S.: Martin, I'm still sometimes using what you told me a long time ago when we were initially talking together. You told me how you were planning to increase what you're automating. You said that many of your tools have native APIs, but you don't have this for your legacy infrastructure, and you told me that you use IP Fabric as an API endpoint for your legacy network. I really like that—it's something that stuck with me.
Martin M.: Yeah, that’s a good quote, to be honest, because we don’t have that legacy network anymore. As you said, we've used IP Fabric for quite a long time, but we still use the API endpoint. With the SRE way of working and AIOps nowadays, it's just nice when you can hit an API and get the data you're looking for.
Stephen C.: Okay, we had a question that came in earlier: How can IP Fabric discover telco-provided provider edges for MPLS?
Vitezslav S.: I'm going to just talk about how IP Fabric discovers in general. We’re just a little bit tight on time, so this is more of a sneak peek—but this is what makes us unique. We don't use any SNMP, we don't use IP scanning or anything like that. It's really based on best practices—what a CCIE engineer would do if you asked them, "Please map my network."
So we use SSH for the legacy devices and API for devices that support it. We usually only need a few starting points—connect to device number one, see its CDP neighbors, see routing relationships, look into the ARP table. If the MAC looks like a Cisco router, we connect there as well. We don’t blindly go everywhere—it’s very efficient and very accurate. I’d be happy to talk about this later with you.
Stephen C.: Yeah, I guess the takeaway is: you're going to do what an expert CCIE person would do, exactly in terms of probing the network.
Vitezslav S.: Exactly.
Stephen C.: Another question came in—I'll read it verbatim: Is there a common sweet spot you see for enterprise customers taking snapshots of their network to best help assure intended state? In other words, snapshots every two hours, every four hours?
Vitezslav S.: I have an answer, but Martin, I’m curious—what’s your approach? How often do you do snapshots?
Martin M.: Let me try to remember from the top of my head. I think currently we have a full snapshot every four hours. And we do on-demand snapshots when we need them in between.
Vitezslav S.: In general, we can discover, let’s say, 2000 switches, routers, firewalls within 20 minutes. So it’s very fast. I’d say best practice is two—maximum three—times a day for a very large snapshot. That’s usually enough to compare state and track changes. Then of course you can do smaller snapshots before and after a change. All of this can be triggered through API as well.
Stephen C.: All right, we just had a question pop in here via the chat. This is directed at you, Martin, and I’m going to read it verbatim: How many devices do you have? How many interfaces, how many vendors, how old is your network infrastructure, how many hosts?
Martin M.: Quite specific. I’ll try to answer without going into the knowledge base or the DCM that we have. Number of network devices—I think it goes into a couple thousand. In addition to that, access points and other network appliances as well. Multiply that by the default number of interfaces—I don’t know from the top of my head how many interfaces exactly.
When it comes to number of vendors, we aim to be multi-vendor. We have multiple—two to five—depending on what you count in.
How old is the infrastructure? As old as we can stay while still being compliant. Everything is under support, so basically the usual lifecycle.
Vitezslav S.: Which, by the way, this lifecycle piece was one of the aspects Martin was testing in IP Fabric back before you were the manager, right Martin? It was some time ago. We do have end-of-life information out of the box as well. I know that was something you were impressed by—it saved time.
Stephen C.: Vitek I have a question for you. Could you elaborate a little more on the specific types of dependency mappings that you maintain? Not a full list, but more on the types—so people can get a feel for what they are.
Vitezslav S.: Yeah, I would say the dependency map is one step above what IP Fabric gives you. IP Fabric gives you hundreds of standardized, parsed tables. So if you have a question like: how is my AAA configured? How does my .1x look? How is BGP configured? What is my global route table? You have one place to go. And it doesn’t matter if you have one vendor in your infrastructure or five—it’s all standardized and ready to use.
With dependency maps, we do know about all the hosts, how they’re connected. We can show you if the path is redundant. This is right there in IP Fabric. Sometimes it makes sense to use integrations with tools like NetBox or ServiceNow and use IP Fabric as a baseline—then enrich it with other sources that we can’t get directly from CLI.
Stephen C.: Okay, we got another question in the chat. Interesting one, Martin: What functionality do you see as a gap in IP Fabric? Or maybe asked another way—what would you like to see added to the product?
Martin M.: That’s a good question. And as you said, I wouldn’t call it a gap—more something we’d like to see going forward. We’re actually planning to submit some feature requests.
As I mentioned, we use MAC address and ARP tables to collect a lot of data, but we’re planning to go even further with LLDP and possibly XDP protocols. We’d like some flexibility with the custom fields in LLDP—so IP Fabric could parse those.
That’s definitely one of the RFPs we’ll be submitting. The basic information is already there, and I think it’s sufficient for most companies. But for us, we’d just like that cherry on top—more flexibility.
And nowadays, what I’d like to see definitely is support for MCP server with agent becoming a thing. We are talking about AIOps, but now it's also a term—AgentOps. It is definitely going to be helpful if we can get MCP server natively supported by IP Fabric, something you can connect and provide context for your LLMs.
Vitezslav S.: Martin, maybe from this perspective, this is something we’re thinking about. We're still evaluating the best way to add support for AI into IPF because we don't want to be just another vendor with an improved chatbot calling it “AI.” The GUI in IP Fabric is already very straightforward and easy to get data from, but something like what Martin described is definitely what we're considering for the future of the tool.
Stephen C.: All right, I think this is going to be a wrap. We're heading up about 40 minutes after the hour here. I'd like to conclude, but before we do that, I believe we're going to put some information in the chat for folks who want to follow up and learn more about Red Hat and about IP Fabric.
Also, folks at IP Fabric will be following up with all the attendees. So if you have any other specific questions—technical questions, business-related questions—by all means, feel free to ask those.
In conclusion, I'd like again to thank Red Hat, thank IP Fabric, and specifically thank Martin and Vitek for taking the time with us today. And thank you all for tuning in—those of you who joined us live and those who will catch this via the replay.
I think that's it, and I will call that a wrap. We look forward to having everyone back here for our next ONUG webinar somewhere down the road. Thank you.