How to Prevent Data Center Downtime
Use these best practices and suggested checklists for managing the human element of critical facilities.
The coronavirus pandemic elevated data centers from resource to utility. While the economy took a severe blow, the shutdown would have had significantly greater negative impacts were it not for our ability to carry on with work, learning, shopping, and entertainment online, from home. There is widespread agreement that remote work will continue for a long time and, in many cases, is here to stay.
For that reason, data centers will continue to serve as utilities. Just as assuredly as we expect the lights to come on when we flip the switch, we expect to be able to connect to online content, classrooms, and corporate networks on demand and without fail.
While data centers have become far more reliable, thanks to a heightened priority placed on uptime and evolving resiliency strategies, outages continue to occur. Most are minor but with so much critical data hosted and stored in data centers, outages today come at a greater cost. The implications of an outage go beyond inconvenience and frustration to include heightened potential for security breaches, reputational damage, and massive revenue loss. Major outages at large companies can have a significant impact. In a few cases, losses in excess of $100 million in value have occurred. The cost of outages is going up.
Respondents to The Uptime Institute’s 2020 Data Center Resiliency Survey reported heightened concern about service outages compared to the prior year. Only about 5 percent of 529 respondents said they were less concerned now. The vast majority said they are as concerned or more concerned about outages than in prior years.
Between increased demand and more valuable data housed in data centers, the pressure is on to make strides in improving uptime. So, what’s the best way to go about that? Where should the industry focus its efforts to improve uptime?
Focus on People and Processes
In The Uptime Institute’s 2020 annual survey, 75 percent of respondents said their most recent downtime could have been prevented with better management or processes. The Institute asked survey participants what the most common root causes for human error-related IT outages were at data centers over the past three years and learned:
- 44 percent said not having the right processes and procedures was the culprit of the outage.
- 57 percent of respondents said failure to follow procedure resulted in an outage.
From there, root causes of data center failures included things like service issues (27 percent), installation issues (26 percent), staffing issues (22 percent), and insufficient preventative maintenance (20 percent). Most of these are a factor of poor leadership and putting cost before reliability—more difficult matters to unravel and rebuild.
For those committed to improving reliability these findings are good news in that they give us a clear path for focusing efforts to improve performance.
Opportunity for Improvement
It’s a known fact that more errors occur when people are performing highly repetitive tasks or in environments where there’s a lack of diversity to prompt mental stimulation and focus. Data centers are, by design, uniform with large halls filled with rows of racks that look much the same.
That consistency, standardization and uniformity is something of a double-edged sword. On one hand, it’s easy to mistake one piece of equipment for another, get lost in a sea of servers and accidentally conduct maintenance on the wrong piece of equipment. One good strategy is to use more differentiation in design and color-coding equipment and equipment rooms to minimize confusion and lessen opportunity for human error. On the flip side, standardized design and operation supports repeatable processes that can be built into a checklist to avoid downtime.
Improving Process, Preventing Error
A first strong step toward minimizing downtime resulting from human error is as simple as validation processes or checklists. While that may sound simplistic, think of how critical checklists are to the military, the nuclear energy industry, surgeries, aviation, and other industries.
In our multi-tasking, distracted world, rigor is everything. The simple process of checking a box creates consistency, ensures that steps are followed and processes are completed to eliminate room for error. This pays dividends in the data center environment.
In this technology-based industry, checklists live on handheld digital devices, ideally with two-step authentication and validation for critical steps on a given checklist. This structure all but guarantees the efficacy and reliability of a host of data center operational processes.
Suggested checklists for data center operations include:
- Maintenance – In addition to keeping tabs on all the maintenance jobs in a given workday and ensuring the right piece of equipment is serviced, the checklist helps ensure proper sequences are followed from powering down to restarting the equipment being worked on. As equipment gets more complicated, mental checklists need to be replaced with digital devices.
- Physical Security – As data center campuses grow increasingly large and subject to external threats, applying checklists to security protocols gives operators security validation, record of visits and tracks visitors’ coming and going. It’s also a good way to keep tabs of physical security: Are all cameras and door keypads operational? Is the exterior fencing secure? Are vehicle gates functioning properly? There are a host of security protocols that can be applied to a handheld, check list feature.
- Crisis Management – Data centers, depending on location, can be vulnerable to threats such as earthquakes, flooding, tsunamis, volcanoes, and hurricanes. At the start of the year, many tech platforms experienced domestic terror threats after they took steps to eliminate certain content and hosting data centers feared compromised data center stability. Crisis preparedness protocols, contacts, information gathering needs, and other resources are another important and systematic element of secure operations ripe to be built into checklists to establish readiness to face a natural disaster.
The best processes and tools only work if you have a team trained to use them. In addition to creating the processes to circumvent human error and installing a system for verification of those steps, the facility must invest in training its staff. Data centers are highly complex and interconnected, training programs and exercises among the different groups that support the facility is a must.
Checklists are, and will continue to be, a large part of ensuring preparedness. Recall Sully Sullenberger who landed an Airbus A320 on the Hudson River in New York City 12 years ago and saved 155 lives. He used a checklist to land that plane, even in desperate conditions.
The humble checklist has prevented many disasters in high-risk industries and should be exploited in data centers to achieve maximum uptime. Even as technologies like machine learning and artificial intelligence gain prominence in operations, the data center staff will continue to play a large part in operating data centers and can maximize their effectiveness with a clear, well-documented list of processes, procedures, and priorities.
The data center business can borrow lessons from the military, medical and aviation industries. By applying checklist rigor to the way we operate datacenters, we can improve uptime for these increasingly integral assets.
Sudhir Kalra is Compass Datacenters’ senior vice president of Global Operations. Prior to joining Compass, Kalra served as Executive Director, Global Head of Enterprise Data Centers for Morgan Stanley. Prior to Morgan Stanley, Sudhir was Director, Corporate Real Estate and Services – Global Head of Engineering and Critical Systems at Deutsche Bank where he was responsible for mission-critical support of a real estate portfolio comprised of over 30 million square feet.
Related Topics: