The Top Ten Ways To Keep Your Data Center On-line
August 1995, Power Quality Assurance Magazine
Most data center outages are not caused by some highly visible, catastrophic failure that takes days or weeks to repair. Most outages tend to be caused by fairly straightforward facility problems that could have been prevented, had some mitigating steps been taken ahead of time. After years of talking to other data center professionals, engineers and contractors associated with data centers, I have come up with a list of ten fairly simple steps that data center facility managers can take to help keep their data centers on line.
One example of preventable failure is a data center that crashed when the diesels ran out of fuel. The diesels had run for years in quarterly tests when the utility power was available. But, when there was an actual power outage, they discovered that the pumps for their underground diesel storage tanks were on utility power. The diesels ran fine until the day tanks ran out of fuel. The following top ten commonsense suggestions could help prevent this type of facility outage from occurring.
10. Don't Place Convenience Outlets on UPS Power.
You never know what will be plugged into a convenience outlet. I know of two data centers knocked off-line because of problems with equipment plugged into convenience outlets. One had a vacuum cleaner plugged into a UPS-powered outlet and the other a drill. Both times, there was a short in the tool and a breaker tripped on ground fault. UPS power should be for critical equipment only.
9. Don't Put Lighting on UPS Power.
I have been guilty of this one myself. Changing a ballast in a light fixture when it's hot is a fairly common, although not recommended, procedure. However, if you cause a fault in a light fixture powered by a UPS system, you can trip a breaker on ground fault. Again, UPS power should be for critical equipment only.
8. Cover All EPO Switches With Plastic Covers.
Every data center has at least one horror story about an employee who accidentally or purposely EPOed the data center. I know of one data center where it happened twice in a month, once by a new employee and another time by a security guard. In both instances, lack of training contributed to the outages. Make the function of an EPO require at least a two-step process.
7. Don't Install an EPO System With Normally Closed Contacts.
While the EPO system will work with either normally open or normally closed contacts, operationally, normally closed contacts are a nightmare. One disadvantage is a loss of power that causes the breakers to trip. Two, if you have to add an EPO button, it's very difficult to accomplish without a shutdown. And if there is a break in the EPO circuit, the breakers will shunt trip. A key to reliability in a data center is maintainability. Normally closed contacts on an EPO system are not maintainable.
6. Never Leave an Inexperienced Person Alone in the Data Center.
I have heard of vendors working in a halon protected area, who lit a propane torch without notifying anybody or shutting down the halon system. I also know of a vendor who shut down a PDU so that they could add a breaker to it. The PDU was feeding a mainframe! Do not assume everybody knows what they can and cannot do in a data center.
5. Set Up the Chiller Plant to Fail Running.
A data center I work with had a failure of a pneumatic line that resulted in a shut down of the data center. The chiller plant was set up to shut down in the event of a control failure. It should have been set up so that if the controls failed, the chiller plant kept running.
4. If You Have Water-Cooled Mainframes, Place One of the Circulating Pumps on UPS Power.
An outage of 2 - 3 minutes could cause a water-cooled mainframe to thermal out. By placing a circulating pump on UPS power, you might extend that time to 10 to 15 minutes. This extra time might allow you to get the chiller plant back on-line, without affecting the data center.
3. Regularly Test Your UPS System.
Problems with a UPS system do not often show up in a steady state situation. They will often occur during a transition from utility to battery power or from batteries to generator. A failure during a test is much easier to deal with than one during a critical time.
2. Monitor the Condition of the Batteries.
The failure of a single battery can result in the loss of the critical load. If you are doing impedance testing and a maintenance-free battery is giving erratic readings or has post seal leaks, replace it. The cost of replacing a battery is insignificant compared to the cost of an outage. If a flooded cell is showing signs of deterioration, have someone familiar with large battery systems review it and if necessary, replace it. Again, it's much better to spend a couple of thousand dollars testing and replacing one or two batteries, than to have a failure that causes the data center to crash.
1. Perform a Full System Test of Your Emergency Power System.
Too many facilities test only a portion of the system. Testing the UPS by itself, or the generators unloaded, or just the batteries, doesn't ensure that the system as a whole is functioning properly. Drop power to the facility and make sure that the generators start,and will carry the load, that the transfer switch is functioning properly, and that the UPS/generator integration works and the batteries will carry the load until the generators are on-line.
I know of two data centers that were unwilling to perform full system tests. However, they regularly tested pieces of the system. When they had an extended utility outage they both crashed. One because of the failure of a diesel fuel pump on the day tank for the diesel, another because the float voltage on the NiCd batteries they used on their diesels was not high enough to keep the batteries charged without the added voltage from the charger. You cannot rely on your emergency power system working as designed if you do not test it.
These steps are obviously not the only ones you should take to improve the reliability of your data center. A good place to start is to commission a reliability study. This study should include electrical and mechanical design reviews, as well as a review of maintenance and testing procedures. Lack of proper maintenance and testing may be the single biggest reasons for facilities failures in data centers. Once the weaknesses in your data center are identified, correct them. Too many reports end up on the shelf until there is a problem. Remember, reliability in a data center is a process. Each improvement you make is part of that process.