Join Daren Fulwell and Dan Kelcher as they go up into the attic and dig out some creepy relics of their earlier careers - occasions when they would grapple with ghosts, search for spooks, and vanquish vampires! Find out how IP Fabric could have helped them turn these horror stories into fairy tales!
Transcript
Hello, and welcome to today's spooky, Community Fabric episode. And welcome to Dan, the most recent addition to our solution architect team. Hey, Dan. Enjoying your 1st few months with IP Fabric? Oh, it's a blast.
So much fun. I'm glad you're in touch. To customers and going out and doing events and all that. So good times. Excellent.
That's all I like to hear. Now I've asked Dan to join us not because he's scary, although you could argue, but because he and I were chatting about some of the the networking horror stories that we've seen over the years, and we thought this would be a great opportunity to share some of those and perhaps take a look at how network assurance could have helped us out. Now if you, if you're watching and you've got any questions or comments, just drop them into to the LinkedIn chat as we go, and we'll do our best to answer them. So, Dan, your tale of woe, where does it start? So it was a dark and stormy night.
Figure that's how how all horror stories need to start. It's gotta start that way. And in in actuality, it was, like, midday and wasn't too bad outside, but, set the stage a little bit. I was working as a as a network consultant. So going out to customer environments, and project that I was working on was to do a, a core switch replacement for a customer.
So on the mountain, we're we're kind of looking at the existing environment, looking at what we're looking to to implement, building out the the plan of action. And as kind of a passing comment, they they'd said that they they've been having these just weird network issues. It's what every network engineer loves to hear. Weird issue, occurs randomly. They they don't really know what's going on, but it just seems everything goes goes down for a little bit.
And then before they can figure out what's happening, everything's back up again. They've tried looking at at LAN links. They've tried looking at Bandwidth. They've tried looking at event logs, and they they haven't been able to find anything. But they they don't really know what it is.
It's just kind of gremlin in the system. Well, okay. Yeah. Good to know. We'll we'll keep that in the back pocket, but troubleshooting random network issues not really in scope for a a switch replacement project.
So I start digging into the configs and and looking at things, and I I realized that there's there's a couple of lines of config that I would have expected to see on a a core switch that I didn't, namely setting a a root switch for spanning tree. And okay. This is this is interesting. So we started to to dig in in, like, the the joy of of networking. So where is it?
Well, do you guys have documentation? No. No documentation. Okay. Let's let's talk about what the network is.
So we I worked with the client, and they started whiteboarding out. So we've got our our the core switch was in one building, and then they had layer 2 links out to other buildings. They had IDFs in all the different buildings, and they had the same VLANs spanning all the different buildings, all the different closets. Everybody loves stretch layer 2. It's like that.
Okay. Let's see where we can we can figure this out. So the, the good old the hip bone's connected to the logged into the core switch. Where where is our our root bridge? Okay.
It's out this interface. Okay. Where's that interface go? So we're logging in switch to switch to switch. We finally get to a device where it's it's connected here, but we can't log in to it.
We get the the they have no idea what this device is. They have no idea what's going on, but it's in another building. So they they take a note of it. We update the configs or get a change in to update the configs on the the core device. And, eventually, the the client the their network engineer goes out, traces that patch cable, and finds that it's there was a a archaic switch that was sitting underneath someone's desk.
And every time that that switch would get bumped, the power cable would get bumped like, hey. We're gonna we're gonna clean, so we're gonna move stuff around. Anytime that switch got knocked offline, spanning tree had to reconverge and sort out the entire network. And because it's the core network and then across a layer 2 link to a site, across another layer 2 link to another site, it was at that far end, and then it was MDF to IDF from the IDF to this desk. So you're going all of these hops to get to the center of the network and then propagating out.
So, yeah, every every time that switch get chaseled, the entire network had to reconverge, Knocked everything off for 2 and a half minutes or so. And it was, yeah, it was one of those, like yeah. This this absolutely explains the, like, the Scooby Doo pulling the mask off. Just like, oh, archaic switch. It was you all along.
But, yeah, they they were struggling with these issues for years, and they couldn't find it. That's a shadow IT job, though, surely. It's it's the guy with the mask. Yeah. It was basically the the shadow had had installed the ghost switch and Right.
Like, somebody's at some point was like, hey. We need another port here. Let's just grab this old switch and plug it in, and nobody more than likely, it had been there for years, and nobody knew it was there. Nobody knew to look for it. Nobody could log into it.
So, yeah, it it had probably been there for since the dawn of time. And because it was archaic, its MAC address was 1, and it became the root bridge. And I guess I guess yeah. Not being able to associate the outage with someone kicking the the the power lead every now and then. It's, Yep.
Yeah. There was there was because the person that was at the desk, like, if they were moving something, like, they might not have been working or doing anything, so they wouldn't have have figured it out because it was layers out. And by the time that, hey. We're having a problem, people start logging in to look. All of a sudden, things are working again.
So they didn't even know where to start looking. And even with the various management tools that they had, there was there was no indication that it was a a reconvergence. So I guess once you find that, how did you go about resolving it? The it was I mean, the fix was easy. It was just setting the the bridge on the core switch.
Done. Now our our core will automatically take over. But the other thing that was was big with that that was not knowing, because we couldn't get into the device. I I guess we could've looked at the the spanning tree settings as far as the the road election. But just making sure that it wasn't there wasn't a priority that was set on that device or resetting it.
And I think they ended up, once they knew the device was there in the 1st place, like, we're gonna just pull this entirely. No. It's gonna be nice. It's, yeah, we can if they need the extra port, they can get another drop run or they can they can solve that problem other ways, but random, managed switches that aren't managed by IT are problematic. Yeah.
But Yeah. And and, again, I suppose that's the question of knowing what's there. Right? That's making sure you're fully aware of everything that's in the network that is documented and so on and so forth. Exactly.
Knowing that it's there, the documentation, the very few people document layer 2 and spanning tree information anyways. So not having a topology diagram, not having that stuff to to quickly find meant we had to spend, I don't know, half hour to an hour just stepping through, log in to a device. Okay. What's the adjacent one? And then because I was I was a consultant, I didn't know the environment.
It's here's the device name. Where is this device top or topographically? Where are we going? Where's tracing this through so that we can understand what we were working with. And, yeah, just having no idea where the route was, what we were connecting to, what was going on, It was, yeah, perfect storm.
And the amount of time that that must have taken, the the amount of effort of and concentration of people's minds and and whatever else, that was a very expensive fix. Well and it's yeah. For what effectively was, you know, line of command of a a STP priority, really, really easy fix. It it took us probably all in maybe an hour or 2 to track down and and figure out how we wanted to resolve. But the bigger thing was, yeah, 2 minutes of outage every couple weeks, every couple months Right.
For years. Like, this it's and it was the because that layer 2 network spanned the entire the entire, I don't know, 7 buildings, 8 buildings, you're knocking out thousands of people for a couple minutes. It's it was a nuisance problem, but it was never something that they were able to track down. So, yeah, years of nuisance. And I guess, I I mean, some would argue, of course, these days looking at it that you would never design a network that way or build it that way.
And and they'd have a fair point, but obviously, you know, time's for different. Right? Well, even now, there's there's still plenty of spans layer 2 networks. Like, this is this is the ongoing debate where, sure, it adds some convenience to do, and there's there's some some arguments that people will make for stretching layer 2 and and having those large networks. There's a whole lot of counterarguments to that, namely spanning tree is problematic and then the way that broadcast like, there's there's a lot that needs to be considered there, but it's it's also the the joy of just network design where Yeah.
There are some valid situations where you might need to do stretch layer 2, and there's so it's for as much as a a network engineer, you might wanna go, no. This is a horrible idea. We should never do it. It's still gonna happen. There's still gonna be some some considerations where it does, and then it's we need to be able to understand it.
We need to be able to support it. And there's, again, there's there's just environments where that's that's the way it's set up, and you're not gonna be able to to fix it all at once. You need to be able to to work with it until you can get to that that ideal state. I think I think you touched on something important there that that network design more generally, it's about pragmatism. Right?
It's about being able to give give the the the user the experience they need to have using the technologies in the best way you can. And and what that one's done is, yes, is is the user has had a requirement to have this VLAN everywhere and that's all fine, so long as you build it within the correct constraints. And, obviously, that's the bit that's the bit that broke down right there. So Exactly. Just go on then.
I'm gonna I'm gonna ask you the question. Here's the plug. Right? How would we have been able to help with that? Oh, so this is this is the fun part.
This is the part that when I first kind of learned of IP Fabric, this is what makes IP Fabric awesome. So coming from a consulting background, oftentimes, you get dropped into environments with very little information. So the the first thing is just the dynamic topology diagrams. Everybody likes to be able to to just look at something. The way that you wrap your head around it, visual, you know, pictures worth a 1,000 words.
So right off the bat, having just a topology diagram of the way that the sites were laid out, how things were interconnected, all of that stuff, that would have been extremely useful. Secondly, IP fabric, we can actually display per VLAN where your your spanning tree configs are, where's your root, what are your blocking ports, and all that. So within, you know, minutes of of, hey. Let's figure this out, we could have seen the entire spanning tree topology, where things were, where the root was, and really kind of understood what that environment looked like. There's from a discovery standpoint, even if we couldn't log in to that switch, we we could have potentially seen, hey.
There's a CDP neighbor here that that we can't log in to, so the IT department would have known that that switch was there. And then from a a top or a a kind of a table information standpoint, we collect spanning tree stability metrics. So before the the only reason why we found this issue was because a question was asked where well, why isn't a a priority setting on this switch that seems odd? This wasn't a, oh, I'm looking for it. I I stumbled across this completely accidentally.
So having that information initially, you get into the if they if they could have seen that the switch was there in the first place, they could have solved this long ago. Would they have known that there were spanning tree stability issues? They could have resolved it. They would have known where to look. They would have had the right information.
Would have been solved long ago. Or absolute worst case, even if nobody was looking at that, once we got to that point, like I said, hit the diagrams and and find that information within minutes instead of yeah. The 3 minutes instead of 45 plus. So it number 1, it could have avoided the issue entirely, or shorten the time to to resolve. And, generally, I mean, this is just a spanning tree issue, but that that same that same kind of set of of rules applies.
Getting the information about an environment, number 1, you can identify problems before their problems, or in the event that there is something that's happening, you can you can figure out where much quicker. You've got all the information that you need. You can drill through to find that that root cause without having to connect into a device and then the next device and then the next device. Yeah. Yeah.
Yeah. Run a half dozen show commands and piece things together. So, yeah, it's it if not, like I said, if not avoiding it entirely, definitely would have shortened the the resolution time. No. That's that's that's cool.
I mean, that's that's the beauty of it. Right? You've got the the snapshots there that give you the the view over time of all of that data. And and the one thing you put you touched on this before, the the idea of having state data from earlier, points in time as well, being able to track change in state from one point to the next means that you've got additional layer of of information that you would not normally have even with documentation. So, yeah, some interesting interesting points there.
I'm I'm gonna talk about one of mine. And actually, interestingly, mine has a bit of a layer 2 flavor as well, but, subtly different. And this was a real horror show. If the if the customer who who I had this problem with is listening, they will know instantly what we're talking about, eve even probably by now. I was working as a as an engineer with a customer who who deployed OTV between their data centers.
Now OTV is a Cisco overlay technology, right? And what's its purpose? It's to stretch layer 2, between those data centers, but to do it in a way that doesn't involve spanning tree. So it's it takes that that particular complexity away in theory. Now they had to with other complexity.
With with another layer of different complexity, of course. Right? So because it has its own control plane and and it's an underlying network. And so but the the the intention here was was they had 3 data centers that they wanted to be able to do v motion between. Right?
So there's the there's the reason, they wanted this fast fail over and and for their applications and and to be able to do this this have this written in a VMware environment. But and after a few teething troubles, actually, it worked really well. It worked smoothly, Servers in one subnet happily talked between DCs, and they were all fine. But when one dark stormy night, all that changed, of course. In the in the dead of night, we were working on some failover testing.
First taking out individual routers in the OTV, and then we got to dropping circuits between some of the locations just to make sure that everything was was doing as it should. Because OTV runs over the top of of of pretty much any underlay network, we should have been able to just drop circuits wherever, and it should have rerouted via other locations and just just done its thing. And so it looked like it was doing that. Pings were fine. We were browsing websites on the same VLAN from from one place to another.
All look good. Right? So we're we're signed up. So I thought that that sounds fine. All working as it should do.
We did notice one one weird little thing that our radius, authentication when we were logging into switches was was timing out. It's a bit odd. But this the customer was using this creaking old ACS server and basically assumed they didn't like the failover very much. So, so testing was was you know, they put a tick in the box, everything was fine, and we moved on. The the one of the circuits was being relocated, over the following days, which is why we were doing this failover testing.
So path Without so they can monkey around with this backup, this this circuit without impacting the live network. Fantastic. So everything's fine. Returned the following morning, and we should have known that something was wrong, really. The there were these dark thunder clouds over the office and lightning and all the rest of it.
And as we opened the doors else was sunny. It was just Right. Just I was saying. Yeah. Just always this rain coming up.
And and as we walked through the doors, there was like a deluge of complaints and problems as we came through the steps into the office. Basically, nothing was working, or they must claim nothing was working. Everything was down. No one could do anything. It's like same same as you.
It's like, what does that even mean? No idea. So, of course, the the first reaction was, well, what do we change and what what needs fixing? Just gets to the the really big scary part. Now it's like the this is the cliffhanger moment.
I know. Right? And and then the phone rings. So, anyway, where were we? Right.
No. Everything was down. No one could do anything. It was it was a nightmare. So our first reaction obviously was to walk in, restore the circuit that that was down because we thought, oh, that that that'll fix it.
But it turned out it wasn't an option. They'd already started the work to move it. So we were we were a bit stuffed there. So we needed to work out why connectivity was as it as it should have been. And as we said, ping was fine and HTTP was fine, but anything active directory seemed to be a bit a bit shaky.
Didn't make any sense. OTV was working turning it off and on again. Well, of of course, we tried turning it off off and on again as all of the things. Right? But, of course, that that didn't get anyone anywhere.
Yeah. OTV seemed to be fine. Things were talking ISIS, which is used to to as the root and protocol to exchange MAC address information across the edge, all seem to be doing what it should. All the information was there. The tunnels were up.
Everything seemed okay, but it was only this authentication that was seemed to be the problem. So the first thing first, well, the third thing, I suppose after after doing the turning everything off and on again and doing the stuff, get to the whiteboard and start drawing up what it is because of course this is all happening on a on an environment that's documented how it should be, but not how it is, because there's a whole bunch of other stuff there and the old documentation's out of date, drawing it up on the whiteboard in order to track down exactly what it is that's going on and to to build a path and an understanding of of all the bits of the of the, of the network. And we're checking every port, every, every interface, every configuration and everything looks fine. No, no problem there. Couldn't couldn't work out that that thing of it only being related to, to authentication was still troubling us until we put a Wireshark on the network.
Right? So, ah, well what we'll do is we'll sniff what's going on on that AD authentication and see what's happening. So we put a Wireshark trace onto the ACS box. And what we saw was we could see the radius stuff coming to the box. No problem.
But when it was supposed to go off and connect to the ad in the back end, it wasn't working. This weren't seeing anything on the trace at all. Now we could see the TCP and we could see the ICMP, but we couldn't see any UDP coming off the thing because it's Kerberos. Right? This is using as its authentication, which by default runs over UDP.
This is odd. What's going on here? Anyway, go to the other end of things and and look to see the ACS end of things. So this was at the active directory. We go to the the other side of things and and have a look.
And we see that that these these requests are going out from the ACS box, but they're really, really big packets. They're full. Right? For whatever reason, it's padding the packets out to their full size. So these are being dumped somewhere.
So there's something going on on here. And, of course, alarm bells are starts which which were sort of faint to start with and now getting louder and louder because you're saying, okay. I've got a packet of a certain size that's getting through my ICMP. That's all good. I'm getting HTTP getting through, but that's TCP.
So we're actually negotiating our TCP windows and our packet sizing, but UDP, which is full packet is failing. You can see where it's going. Right? This is it's all of a sudden alarm bells, MTU MTU, screaming across the across the floor. Everyone knows what the problem is now, but now we've gotta try and find it.
Right? Back to the diagram on the whiteboard and logging into every device step by step. Is it this interface? No. It's not this interface.
Is it that interface? No. It's not that interface. Working our way all the way through the network. And so we track it down to the of course, the last interface we tried, the MTU is, neither neither set for jumbo frames or, has been increased in any shape or form.
It's it's there at its default, which isn't big enough to take a full packet plus the the OTV overlay, headers. Found the problem. Fantastic. Soon as we change that MTU, bang, everything's fixed, restored, working. And, of course, then it takes 2 hours for everything to, to for the tumor to die down and abnormality be restored.
Right? But, again, you see now you've you've touched on a number of reasons and a number of ways that IP fabric helped would have helped you replay. Right? All of those reasons. Having good documentation.
Right? Knowing exactly where everything is, we wouldn't have needed to draw the the stuff up on the whiteboard. We wouldn't have needed to work out what those flows were because we could have done path lookups, right, and understood end to end which devices were impacting which interfaces. Having intent validation, being able to turn around and say, actually track that the MTUs at either end of every single link are consistent. Right?
IP fabric has that in the dashboard sitting there waiting for you. Right? So every time you run a snapshot, it's checking that for you. And, of course, you've got that ability from snapshot to snapshot to see what's changed and what goes on. It just so happened that a number of weeks later, said customer was swapping out a device, somewhere in the network that had failed and guess what happened?
They put that they put that device back in and didn't set the m t u and therefore had, let's just say a similar problem. Yeah. I know. Right? Because we all learn from experience apart from when we don't.
I've, yeah, the number of times where I've rebooted device and, oh, I forgot to save the config on it. I you'd figure I'd learn by now, but, yeah, there's I think I think everybody everybody needs a backstop. Right? And some some sort of, some sort of thing to to fall back on. And, yeah, on that occasion, we didn't have it either.
So but, yeah. No. I mean, it's this is the point, right? I suppose that we get to when when you've got that ability to to track what things are doing, day to day when you've got that that documentation that you always know is gonna be up to date, but then you can build these rules that actually check things for you. So you've got standards that are met, so you know that your spanning tree root is defined and where it's defined, or you know what your MTU sizes should be when when when, you need them set for an overlay, then you've got that information available all times and it's always checked for you, double checked, and and triple checked.
And I think the the fun thing here is really in both scenarios for for whatever the issue is, the resolution is basically a a single line. Like, it's not the fix isn't complicated. The the bigger problem, the more time, the more headache is trying to to identify the problem and where it is than the actual fix. The the fix both scenarios, like I said, one line. Like, super super easy to do, but it's, yeah, trying to hours or in some cases, years of not being able to find it.
Yeah. That's the big part. It's the needle in the haystack, isn't it? I mean, this is the problem. The the every network is so complex.
There's so much going on that in order to make sure that that you you can find a problem when it does occur, you need that consistent level of information and and and documentation at every single part of the network. Otherwise, you're just gonna you're just never gonna know, are you? And the more that you can automate with the the actual process of doing the checking and the validation for you, the easier it gets then to to cut to the chase really. I suppose it's it's like you said, it took you however many, you know, however many hours to to resolve your problem. Same for for me.
And that's that's not just the amount of time it takes to solve the problem. It's the amount of time of the outage that it's that it's caused over time. So in your case, you know, 2 and a half minutes, however often that that cable was dislodged. In my case, 3 hours of a busy trading floor in a in a an organization that really doesn't need to be down. So, yeah, it's costly, costly mistakes to make.
So just because I'm gonna throw out the the softball question for you. Yeah. So speaking of MTU, one of the fun things that MTU brings in, different platforms, different vendors. Okay. Way that is jumbo frames 9,000 bytes?
Is it 92 16? Is it, like, everybody's kind of got their own settings for for that. So IP fabric, if I've get 2 different devices that have a different value for what jumbo frames might be, how do you deal with that? I suppose the beauty of the beauty of all of these things is, right, that the IB fabric, and its data model is normalized. Right?
So so we go through a process of normalizing the the data in on the interfaces, and we have all of that, available to you. So so the real really the trick, I suppose, is is knowing what it should be in order to deliver, it from a normalized standpoint, knowing what it should be in order to deliver the outcome, and then creating those intent rules. That's that's the key. Because once those in fact, there's there's an intent rule. There's actually one in built into the platform that that tracks that things are consistent either end of a link.
So you you don't even need to put a value on it if you don't want to, but it makes sense to if you've got a standard, set across the organization, create that that, intent rule to validate that as well. So that's that's really the the the best answer from from an IP fabric point of view. And again, for any other thing that's similar, like this mic spanning tree. I mean, just knowing that you've got a spanning tree priority set on a on on every device is is going to be advantageous. And because we can then track and find out because they're set, because we know what they are, what where it is that they are, we can we can indicate those in the diagrams.
Right? So so, yeah, I mean, that's that's the key. Know know your network, but you can only know it when you've got the data And and that's always the gap, in the certainly in my experience. You don't just go through and spend all day updating Excel spreadsheets and Visio diagrams. I was gonna say Visio Visio diagrams.
Right? That was that was that was certainly a a huge part of my life. I've I must admit, and this is this is an admission. I haven't touched a copy of Visio in two and a half years, but that's, there you go. That says a lot, I think.
The, and you always had the layer 1, the layer 2, the layer 3 Visio diagram. Obviously, you had every site, every switch. Here's here's my my OSPF. Here's my BGP information. Yep.
I making sure all your lines are just right and color coded. Exactly. I mean, that sounds like a nightmare to me. Just just keeping track of that a lot. It's that that could be its own nightmare episode is just Yeah.
I'm making that serious IT documentation. Bum, bum, bum. This is a really that's a really good point, actually. What we could do and and if you're still with us, hope you are. If you've got any networking horror stories, maybe this is something we could pick up, Dan.
Right? Share them with us and let's, perhaps see if we can talk you through what IP fabric can do to help you because ultimately that's that's what the product's there for. It's to to make to give you time and give you, give you back some sanity, I suppose, and some some way of of, protecting you from from these horrors. Dan, I'm gonna thank you for joining me. Thank you for sharing your story.
It's awesome. Really good to see how even experienced network engineers like yourself, can save time and and I suppose money for the business, right, by using the right the right approach and the right tooling. So so thank you for your time. Thanks for having the conversation. This is it's a whole bunch of fun to rehash all the the horror stories.
Lesson learned. There are plenty more where those came from. Right? Oh, there's a wealth. Maybe maybe for the next episode, though.
Nice one. Thanks for your time, mate, and thank you everybody for, for listening.