In our last blog we talked about “The Scary 40%”, representing the percentage of issue raised by customer yet invisible to the networking team. This is a rather sobering statistic showing how little we really know about our infrastructure and the services that run on top of it. We argued that this is the reason why many IT organizations today are stuck in a reactive mode of operation. Lack of visibility, and we argued, the right kind of visibility gets the networking team stuck in perpetual defense. So it is logical to expect the networking is spending a lot of its time on solving problems that are already reported by customers and escalated.
The same EMA study we referenced in our previous post found that a networking team spends 37% of their time on troubleshooting issues. Let us not forget that the network team gets pulled into virtually all IT related issues experienced by and organization because, on one hand the default answer to an issues is “It is the fault of the network” and on the other, the networking teams are the master troubleshooters (maybe because they did spend so much time doing it). To add to this rather dreadful reality, many of the services we are running over our networks today do not originate on our networks. Many more people/organizations are involved in service delivery than it used to be the case in the past and that changes both the troubleshooting process and the tools being used. We also create many overlays which make it more difficult to see over which layer is the problem experienced.
All these changes in the IT environment should not however justify having network engineers burned out on pager duty or stuck in troubleshooting one critical issue after another. It is all about working smarter, not harder:
- Reduce the number of inbound call – Instrument the infrastructure to collect data that is not meaningful just to the network team but also to the apps and services team. Let them have the data that allows them to decide if they really need to involve the network team or just the data center team
- Stress the truly important stuff, prioritize issues based on user experience not on network metrics – Nowadays we build very reliable, highly redundant networks. If we did a good job, all that investment should mean that we do not have to stress over an interface down notification or a router reload. “Are the customers affected or did my network find another path?” By looking at the higher-level metric of User Experience (UX) we act and prioritize around the service, not the pipes
- Do system wide troubleshooting, not your traditional hop by hop – Instead of hopping from one device to another in order to make sense of an issue, get the tools that in one view can show you what is going on across the entire environment. In my experience, a network engineer who know his environment can look at such an on demand report and immediately know what the problem is or very likely where it might be
If you are a network engineer reading this, don’t worry, you will not lose your relevance in the organization through these optimizations. All you will do is get your life back and get a chance to hone your skills in the brave new World of Cloud, SDN, NFV and IoT, while keeping your boss happy. So drop the menial aspects of the dreadful 37% and focus on the interesting problems while spending more time on IT transformation (or coaching a kids soccer team).