In Good Working Order
Communications and maintenance are key to keeping mission-critical facilities up and running
Building owners invest a significant amount of money — up to $1,200 per square foot in some cases — in the design and construction of mission-critical facility infrastructure. These facilities can theoretically operate with an uptime standard of up to “five 9s” — 99.999 percent of the time, or less than 6 minutes down time per year.
That’s important because a momentary business interruption at the computer level can cost the organization hundreds of thousands of dollars in lost system productivity or good will. But it takes more than a good design to protect that investment and ensure downtime is minimized. Good operating and maintenance practices are also crucial.
Those practices assure that, no matter what time of day or night, or what day of the year, emergency work is performed effectively, in the shortest possible time, while keeping the facility online if possible. To achieve this goal, replacement parts must be readily available on site or in a nearby warehouse, trained manpower must be available on a moment’s notice, and the process for performing the work must be known by everyone.
That means it’s not just first costs that are higher than ordinary. After first costs, additional facility operational expenses include the cost of quality replacement parts, housing those replacement parts and a qualified facility surveillance system — one that can effectively deliver an alarm message for any system at any time, under any circumstances, to selected personnel. Other costs include additional on-site personnel to handle the additional maintenance requirements, trained, on-call vendors, increased scheduled maintenance, and additional training for all technicians.
Higher costs can be justified by showing that the cost of a business interruption in terms of hard and soft money — real dollars and business good will — outweighs the additional real expense of first cost and total operating costs.
Achieving Reliability
Maintenance as a work element is traditionally associated with preventive maintenance. This class of maintenance reviews the operating status of a piece of equipment on a periodic basis and does some basic tasks, such as change filters, to keep the piece of equipment operational. Most facility executives manage their equipment maintenance through a scheduled preventive maintenance program. Computerized maintenance management systems are commercially available.
Predictive maintenance goes one step further. In its simplest form, the equipment or system is analyzed to validate its condition. The goal is to predict failure, so that maintenance can be scheduled and performed before failure occurs.
Concepts like reliability-centered maintenance, previously applied mostly to airplanes and equipment supporting production processes, are being adapted to mission-critical systems and equipment. These concepts rely on both the operator and the person performing maintenance to keep equipment and systems operational by examining their use and eliminating points of failure; analysis techniques of predictive maintenance are also employed. Reliability-centered maintenance also seeks to replace emergency repair with scheduled repair.
Predictive maintenance and reliability-centered maintenance are ideally applied to critical facilities because both methodologies involve significant analysis of the systems and equipment in a facility to find ways of minimizing repairs and optimizing maintenance intervals.
Usually, repair in any facility happens with little fanfare and, if done quickly, with little inconvenience. In a critical facility, repair is a big event.
Repairs to equipment often occur when no one is around, late at night or during weekend hours. For critical facilities, off-shift windows of opportunity disappear: There is no off shift.
Lines of communication need to be established to tell management what repair needs to be done and to gain agreement with management that the level of risk to the facility from the repair action is warranted. Most critical facilities operate on a schedule set by IT, which also usually dictates conditions under which repairs can be accomplished. The repair work procedure needs to be written out, agreed to and clearly understood, including the part about what happens if something goes wrong during the repair and exposes the facility to an unacceptable level of risk. For emergency repairs, the need for an orderly process is even more acute.
Change Management
This communication process really involves change management. In this case, the repair itself represents change — change in the way a piece of equipment operates. A change-management process establishes a procedure to minimize adverse conditions at the facility during the course of the change, whatever it is. The IT group drives this process, as facility downtime for any reason affects IT and network systems that are the basis of the facility income stream.
The key to success with change management is communication. This effort starts when the person charged with managing the operation of the facility puts together a plan for briefing company management about what would happen if repair work is needed. The process of change management includes carefully documented work scopes, including individual responsibilities and parts required. It also addresses the scenario of a repair failure, where during the repair something goes wrong and the repair has to be stopped.
Finally, the work itself must be coordinated to minimize the time of repair and risk of facility disruption.
In a critical facility, any maintenance — preventive, predictive, or reliability-centered — or infrastructure repair is always considered a change and is subject to the change-management process. For a repetitive maintenance task, the key is that, during the initial development of the task, every effort is made to determine the risk of interruption to the facility and to minimize this risk.
Both predictive and reliability-centered maintenance require a detailed change management process to be put in place. Reliability-centered maintenance and to a lesser extent predictive maintenance also target repairs before repairs become emergencies. This improves the possible uptime for the facility by taking away the risk inherent in emergency change management procedures.
Critical facilities are expensive to maintain. The payback is that customers are served every hour of every year. Maintenance and the accompanying category of repair are crucial to assuring that the owner’s investment in design is paying back. The extra effort to make both the maintenance and repair tasks at a critical facility as risk-free as possible should be considered a relatively inexpensive insurance policy.
Robert Atkins, P.E., is a senior associate with EYP Mission Critical Facilities, Inc., located in the San Francisco office.
Related Topics: