Crash
Two cautionary tales show that minor details can lead to major data center problems
Every facility executive responsible for data centers can tell at least one horror story. Some are from direct personal experience; others are data center legends. Who hasn’t heard the story of the hapless IT professional who leaned a box against an emergency-power-off switch — powering down the data center — or the curious data center newcomer who wanted to know what the little red button on the front of the power-distribution unit did? Stories like these show how hard it is to prevent data centers from failing. Every data center is unique — there is no other like it. Every design is a custom solution based on the experience of the engineer and the facility executive.
Data center failures can be rooted in several sources — design, construction, maintenance, quality of material, quality of equipment, commissioning and direct human intervention. For the most part, data centers, even ones that fail, have the benefits of good design practice and intention, professional construction oversight, and high-quality craftsmanship. They are maintained according to data center quality guidelines. But a single overlooked mistake can quickly escalate into more significant issues — power and air conditioning failure — that can bring down a data center.
A good example comes from the colocation business, which is made up of real estate companies that offer tenants space, not in office buildings, but in data centers. The occupants are servers, not people. The data center real estate company brands its services based upon a promise to deliver non-stop climate control and power reliability. One moment without cooling or power harms not only the tenant, which stands to lose revenue as a result of down time and recovery time, but also the colocation company’s business model. Because customers who use colocation space are not necessarily part of the design, construction, commissioning and maintenance process, data center facility executives take on an extra responsibility of ensuring that the buildings run correctly.
One data center real estate company that maintains more than a million square feet of colocation space nationwide recently lost a data center as a result of a construction error that exposed a design miscalculation and a commissioning flaw. Cabling between the generators and the paralleling gear had been damaged during construction. While being pulled through the conduits, the cable insulation had been nicked and scraped. The damage was not enough to be detected by normal meggering — a test of the resistivity of insulation — but enough to create a weak link in the mission critical power chain. Eventually, the cable insulation failed.
If all things are correct, the loss of a cable should not be an issue. The design engineer had foreseen the potential for generator system failure and had designed paralleling gear with the programmable logic controller (PLC) programmed to handle this fault. When the fault occurred, the PLC began shutting down the entire generator bank. With the system experiencing a cascading failure, the PLC was unable to intervene.
When the shutdown event was complete and the paralleling switchgear was cold, the entire site transferred to the battery. Within the design time of 15 minutes, the batteries were depleted and all customers were left without the service of their computers. The data center had failed and the colocation company’s branding promise had been seriously compromised.
Why did this happen? Was it a construction error? A commissioning oversight? Could this be pinned to the owner’s design manager, the one who devised the paralleling scheme from the beginning? How about the engineering design team?
There were multiple causes for the failure. In this instance, a construction craftsmanship issue revealed a design shortfall.
Diagnosing the Problem
In hindsight, it is clear that even more rigorous testing before commissioning was needed. Additionally, this failure indicated that the PLC had not been programmed correctly to clear this fault condition and thus had not been commissioned with this fault scenario. And this sequence should have been part of the preventive maintenance program — a change that was made following the disaster.
The design/commissioning team had not anticipated the exact failure sequence. This project would have benefited from more involvement during the design phase from a commissioning agent with specific experience in PLC programming. Additionally, a third-party reviewer with topical design and operating experience would have added value if brought into the design process.
Every data center is one of a kind. The better the commissioning team can simulate real-life scenarios, the more reliable the data center will be.
If the data center just described went down with hardly a whimper, another data center crashed with a literal bang.
In a multistory, high-profile government data center, a busduct-panelboard connection exploded, effectively shutting off power to approximately 15,000 square feet of the most critical computing in the facility.
In this incident, the design relied on an isolated redundant uninterruptible power supply (UPS) back-up. When a UPS system failed, a static automatic transfer switch was to shift to the already-operating isolated redundant UPS and transfer the load within a quarter cycle. The system worked well and the client was satisfied with the transfer scheme and the rotary concept.
Source of the Problem
Where this system failed was downstream from the automatic transfer switch. Each of the switches fed one busduct riser and terminated directly into a main distribution panel located on each floor of the facility — one busduct per panel. A single fault on any busduct or main distribution panel compromised the critical load.
As it occurred, the electrical connection between the busduct and the distribution panelboard failed and the load was lost. A single point of failure succeeded in bringing down the floor. Not until the facility’s electricians ran jumper cables from one of the intact risers and back-fed the main distribution panel did the floor have power.
Why did this failure occur? The building had been designed in tight coordination between the government representative and the designer; the entire system had been commissioned and had been running with tight oversight for more than two years. What happened?
The cause of the problem was the failure of a manufactured busduct connector, one of hundreds in the building. The connector joined lengths of feeder busduct via a sliding piece — designed to slide approximately one-quarter of an inch to make installation easier — and a break-away torque bolt designed to ensure that the installer did not over-torque the bolt.
Although the investigation team was not asked to explain exactly why the joint exploded, it determined that the quarter-inch of play designed into the connector had actually allowed for a portion of uninsulated section of the copper busduct to be exposed to the atmosphere without insulation. The team surmised that the perfect combination of air borne dust, humidity and possibly other contaminants led to an arc that became a fault and exploded.
During the analysis, the investigation team isolated each busduct riser from the static automatic transfer switch at the source and from the main distribution panel at the termination. During the megger test, the electrical forensic team discovered two additional joints that didn’t pass, clearly more candidates for potential failure. Not only did the joints not pass the megger test, two of them visibly and audibly arced while the voltage was ramped up during the testing. The joints had shown themselves to be the weak link in the system. The installed busduct technology was vulnerable to catastrophic failure.
This emphasized the importance of several lessons that might seem like common knowledge, but nevertheless slipped past all parties in the complex design and construction process of the data center.
The first is to eliminate single points of failure. Had there been dual paths to the critical load and either static switch power-distribution units or rack-mounted static switches, there would have been no data center failure.
The second lesson is to use conduit and wire in lieu of busduct. Every electrical connection is a potential failure. The feeder busway system installed had mechanical connectors every 12 feet. Conduit and wire only have connectors at the source and at the load.
Lesson number three is to use only data-center-grade equipment in data centers. The installed busway was inherently unreliable because human error led to one failed connection and the two additional failed connections uncovered during testing.
Unfortunately, data center professionals do not necessarily have the chance to test drive a facility before it’s completely operational. At the end of the day, every data center is a unique and professionals must take all of the right steps to make sure they anticipate future mishaps and learn the lessons of previous experiences.
Five Elements of a Reliable Data Center
Building and designing a data center is a complicated process. The complexity is compounded not only by the building type, but by the fact that each data center is unique, built and designed to meet specific criteria. A successful project depends upon five things:
- Good design with input from the facility executive, builder, designer and commissioning agent
- Good construction, including careful selection of construction firms and subcontractors, as well as effective construction administration and documentation of field issues
- Specification and installation of quality data-center-grade materials
- Effective commissioning
- Thoughtful operational practices and timely maintenance
—Daren Shumate
|
TECHNOLOGY UPDATE
Limiting the Damage From Single Points of Failure
Developments in IT technology are making it possible to achieve an important goal in data center design: to move the most common single point of failure as close to the critical equipment as possible. Doing so limits the harm that occurs should a single point of failure give way.
One change in technology is the growing number of servers becoming equipped with dual-corded power supplies. These servers have integral power supplies which plug into two different power sources. The power supply switches integrally between the two sources.
These two power supplies are typically power distribution units fed from an A and B source. In more sophisticated systems, the A and B redundancy is duplicated all the way to utility and generator back-up. In the most basic system, at least two separate distribution systems are required for a minimum of redundancy. Even if fed from the same panel board, the A/B redundancy allows for power-distribution unit failure or maintenance. If the power supply transfer fails, only that server is lost.
Changes in technology are also affecting single-corded devices. An example is rack-mounted static automatic transfer switches, devices that typically occupy no more than 7 inches of rack space and, if correctly sized, can provide power for an entire rack. Because the transfer between A and B sources is located close to the equipment, a failure in the switch only affects the servers that rely on it.
Rack-mounted switches have become popular with IT managers because they can be applied locally and can be installed without heavy construction. They are often part of the IT budget, allowing on-contract electricians to install them and provide dual-corded redundancy to the necessary rack. The challenge for the design engineer and facility executive is to coordinate between the budget of the IT customer and the facility budget to ensure that the rack-mounted switches are correctly specified, installed and maintained.
— Daren Shumate
|
Daren Shumate, PE, serves as a principal and director of the MEP engineering studio in the Washington, D.C. office of RTKL, an international architectural and engineering firm. He has extensive experience in electrical engineering and related specialties, including power distribution, lighting design, life-safety systems, security systems, telecommunications, emergency standby power, power conditioning, controls and instrumentation. His responsibilities have included the management of a wide variety of projects from conceptual design through commissioning of new systems.
Related Topics: