Data center operations and maintenance teams must be prepared to act swiftly and surely without warning. Unforeseen problems can lead to injury or downtime. Good preparation and process, however, can quickly and safely mitigate the impact of emergencies and help prevent them from happening again.
A new white paper I co-authored lays out a framework for how to make sure operators are prepared when the stuff hits the fan. I feel like this sort of information and advice is sorely lacking today. It seems the data center industry focuses almost exclusively on infrastructure design, the equipment, and monitoring…and always in the context of cost, reliability, and efficiency. Less is published on facility operations, safety, and maintenance in data centers. But, no matter how well designed and implemented the facility is, the stuff can still hit the fan. And so I think it would be a benefit to the industry if more experts would speak out on this topic and share best practices. Schneider Electric helped initiate this by submitting our internal Data Center Facility Operations Maturity Model to The Open Compute Project and embedding it in our White Paper 197. This model provides a detailed scorecard for facility teams to be able to benchmark and perform a gap analysis on their operations and maintenance program for improvement.
As someone wisely said, “good preparation is the best defense” and so it is with facility operations. Preparation begins with developing Emergency Operating Procedures (EOPs) for all higher risk failure/fault scenarios. This includes things like the loss of UPS redundancy, the failure of a generator to start, the shutdown and failed re-start of a chiller plant, and so on…the paper contains a longer list. The procedures should define the precise step-by-step procedures for safely and quickly isolating the fault and restoring service. These EOPs must be placed near the area where the work is to be done. Clear escalation procedures that notify and bring in the right people with the right skill set at the right time is another critical aspect of emergency response. Having well documented and peer-reviewed procedures is the foundation of a good emergency preparedness program. Another key aspect of the program is an on-going training program that involves conducting drills on the EOPs. All operators (not just new hires) should have to periodically demonstrate physically and mentally how to carry out EOPs. The more detailed and realistic the drills can be made, the better. Having people who both truly understand the contents of the EOPs and who are capable of quickly and efficiently carrying them out is obviously critical to being able to restore service safely and as quickly as possible. This is the core of the program.
The paper covers the full emergency preparedness program which also includes an incident management program. This section discusses the ability to quickly detect when incidents occur and how to classify them to ensure the right response and resources are brought to bear. There’s also a notification process that ensures key stakeholders are made aware at the right time (based on how incidents are classified). The final component of the program is about how to report incidents and the importance of doing a failure analysis to understand the root cause and ultimately to prevent such an incident from happening again. If you’re a data center manager or operator, the paper and maturity model are well worth checking out.