Inexcusable Data Center Mistake

Posted by on Feb 1, 2008 | 7 Comments

I’m going to go off on a bit of a (somewhat grumpy) lecture here in hopes that people will stop long enough to listen. A little Gestalt therapy, if you will. Ultimately I hope at least one person recognizes a need and acts on it.

If I had a dime for every time I have personally seen this one issue bite someone in the backside, I’d be a rich man. There are a zillion things that can go wrong on a mission-critical network, but of those things there are actually just a few that account for a substantial portion of the issues that typically bring critical services down.

So, if you run a network and have not addressed the one issue I will describe below, please take the time out of your day to start a plan to remediate the problem ASAP. Along the same lines, if you are not sure where you stand with regard to the issue, or if you have never checked but you feel confident because everything works today and always has so it can’t possibly be an issue… Again, please just take the time to inspect your infrastructure and put a plan in place.

I should also say that if I had a dime for every time I’ve said exactly what you just read in the paragraph above, I’d be a rich man. I lost count long, long ago of the number of hours spent watching people try to avoid – in any way possible – checking the obvious and addressing it. Usually that’s due to those egg-on-face concerns that go along with being the guy who missed something so simple and critical (albeit not too obvious) when it came time to learn the detailed intricacies of running a high-availability network.

Okay, enough with the harshness. Time for the issue at hand.

The number one network mistake I have seen people make on IP networks, over and over again, is using the default settings on their switches and servers that cause the network interfaces to auto-negotiate the speed and duplex settings.

Seriously, if your requirement is to provide high availability and your SLAs require your services be up, do not neglect the critical (but often skipped) process of manually configuring your NICs and switches to the proper setting. Just because the interface says it’s running 100mbps and full-duplex doesn’t mean it’s working, and when your network takes a dive and you start losing packets you’ll be sorry.

Along the same lines, never assume that one half of one percent of packet loss is no big deal. Seriously, if you are seeing retransmits on your network interfaces, something is likely wrong. Also, chances are that .5% loss is not being scattered evenly across your traffic. It may all be happening at once in bursts, and that hurts – a lot.

Again, if I had a dime for every time I (or someone working with me) recommended inspecting the interface settings, recommended changing them, and flagged interfaces where traffic analysis showed data transmission loss that was obviously causing network apps to fail… Well, let’s just say it’s amazing how hard it is to convince some people that their network is the cause of the issue.

Why am I being so blatantly blunt about this? Because I hope that the message will carry, that administrator egos will be set aside, and that people will understand that the real-world evidence based on years of actual experience, proven over and over again, bears out the fact that this will eventually happen to you if you have not already taken the steps to ensure it doesn’t. Don’t let that happen. Protect that ego now, rather than waiting for it to be damaged.

Finally, don’t fall prey to the idea that just because you have high-grade HP, IBM and Dell Servers and Cisco switches that the money you (smartly) spent negates the need to set things up the right way, or that these vendors have everything figured out for you and set as defaults. Point of fact, this issue occurs just as often (if not even more so) with your expensive, data-center class hardware. In fact, Cisco switches have been somewhat famous for requiring intervention of the manual-configuration type. They even have a troubleshooting support article here that you can refer to for your configuration needs.

You have been advised. Now go do something about it. And forward this to every network administrator you know. The network (and ego) you save may be theirs. :)

  • English speaker

    Remediate? REMEDIATE?? The word is ‘remedy’. Don’t invent your own.

  • http://www.matthartley.com Matt Hartley

    English speaker: Actually, Greg is spot on.

    Remediate: Noun. “The act or process of correcting a fault or deficiency.”

    and

    “set straight or right”; “remedy these deficiencies”; “rectify the inequities in salaries”; “repair an oversight”

    http://www.thefreedictionary.com/remediate

  • http://www.linuxslacker.com John H

    Greg, you’re exactly right. I applaud you for taking the time to explain the why’s and wherefore’s of proper network topology setup. As an addendum, one more reason why setting the proper speed/duplex of your network environment is to ensure that other devices that manage servers as target devices, i.e. KVM and Serial over IP appliances, will require specific network parameters to be able to function as intended. I work for a company that manufactures these appliances and am now the technical trainer for this corporation (Avocent). One of the main reasons why our appliances (and so many other devices that talk on the network) provide “unexpected results” is directly related to the laziness of the network administrators/engineers inept understanding of this seemingly simple (yet definitely often overlooked) concept. Thanks again for the article.

    Regards,
    John

  • zookster

    Thanks for the information. This is news to me, and I’ve been working in I.T. for over 15 years, mostly in small businesses.

    If some switches are un-managed and cannot have their speed and duplex set manually, does that negate the value of setting those values on the servers and desktops?

    Is there a way to set the speed and duplex on Windows XP desktops using a script or GPO?

  • mike

    Ok,
    tell me exactly how to do it and I will. My network consists of 2 wired 10/100 and 2 unwired 54. Now exactly how do I do what you told me to do
    the hub is a billion adsl wireless/wired router ?

  • zenium tech

    You’re absolutely correct that every network admin should do this simple action. I once worked for a ‘technical’ guru that first insisted on getting cheap (and I mean CHEAP) non-managed switched that caused nothing but auto-negotiate trouble – no need for those ‘fancy’ settings.

    And at the one client that had a managed Cisco switch he insisted nothing was wrong with the network and it was a Microsoft problem when we were experiencing high packet loss. I did not buy the argument and continued to look for a network problem. Once I found the suggestion to manual set the speed and mode on the server and the switch the problems went away.

    And even the people that should know better don’t follow this simple advice. Several months ago read a story about a VoIP trial at Argonne labs. The network manager had unexplained problems and complained about the vendor. Turns out the vendor DID mentioned manually setting the NIC speed and parameters, the network manager just ignored the advice.

    So yes, it is possible to be a rich person by just repeating this advice to all the ‘experienced’ network managers.

  • Craig

    The only problem with the above advice is that auto negotiation is mandated for gigabit connections.