Statistics show that the bulk of data center disasters are caused by human error. In talking to many data center managers, I've found two root causes for these errors: a lack of defined change management processes and procedural control or the bypassing of existing procedures to get a "simple" change done quicker.
Let's back up a bit first. I'm not just talking about large-scale disasters like hurricanes or ice storms. I'm talking about any event in the data center -- from mistakes to a lack of oversight -- that disrupts normal business operations, costing the company revenue. Natural disasters are relatively rare compared with mistakes made in the data center by IT staff or someone else. I'll "rat hole" quickly on the subject of large-scale disaster recovery (DR) preparedness. I've noticed a trend in my business that has been validated by many others that I talk to. The number of inquiries about disaster recovery planning is seasonal, ramping up in August and slowing in November. This tends to correspond to hurricane season in the U.S. and when the bulk of corporations beginning their annual budget-planning process for the next year.
So now, in early 2009, during the lull in disaster recovery planning interest and as our economy continues to slip deeper into recession, squeezing IT budgets even further, let's talk about how to avoid the most common causes of downtime in the data center.
IT process maturity models: CMM and ITIL
The Capability Maturity Model (CMM) defines five levels of IT software maturity, with Level 5 being the highest. Attaining each level requires a great deal of work, but the benefits awarded are well worth the investment. The IT Infrastructure Library (ITIL) provides a framework model that IT organizations can customize to their needs to achieve a higher organizational maturity level.
But let's talk about the realities of elevating your organization's maturity model. First, it's not an overnight process. Most organizations take around a year to move one level higher, often more. Staff training is required, and the usual issues abound as many staffers fight against change. It isn't until they personally experience these changes (that always appear at first as extra work) that they start to understand the value of the process and become energetic supporters. Still, there are always a few staffers that just don't want to adapt to the new processes. It's unfortunate, but often the best course of action in these rare cases is to move them to other positions or let them go from the organization. About a year ago, I spoke to a company that was working to move from CMM Level 2 to Level 3, and a vice president in the company refused to implement process changes to make the move, saying he didn't see the value in the "extra work." After months of trying, the only recourse was to lay off that vice president and hire another in his place.
Improving organization maturity by implementing processes and governance has shown to reduce errors in IT change management, resulting in fewer data center disasters. But it's not always a sure-fire solution to completely eliminate human-induced errors. Often the little things still find their way into becoming disasters.
The 'small change' data center disaster
Research conducted by the Burton Group has consistently shown that it's the small things that get IT organizations into trouble. The scenario tends to look something like this:
- The IT organization is always searching for more efficient ways to do things -- the most obvious way is to cut fat from any process or procedure in order to streamline it.
- A process action for a small configuration change seems to be something that can be skipped. It's a seemingly small change, and the step seems unimportant when compared with accomplishing the task faster.
- The step is skipped and the task is completed a bit faster.
- The first time, it seems to work without incident.
- The step is skipped again, and possibly again.
- The situation the step was meant to cover occurs and an IT system malfunctions, resulting in a data center disaster.
The bottom line of process maturity improvements is that the processes and procedures are adhered to, even if they do not seem important in all cases. This helps to avoid the temptation to shortcut a process -- something that usually gets an organization into trouble.
It's time to improve IT process maturity
Times of economic downturn offer an opportunity for IT organizations to improve their organizational maturity. During times of prosperity, the IT organization's focus is on building out IT infrastructure and services as quickly as possible to support business growth. All CIOs understand that IT processes are responsible for enabling business growth and should never be viewed as a hindrance to the business. As one of my colleagues says, "In prosperous times, the IT organization works feverishly to lay tracks in front of the business locomotive as fast as possible. But during economic downturn, the IT organization has the opportunity to reflect on their architecture, organization and processes to make improvements for efficiency."
Now is the time for IT organizations to turn their focus to improving organizational maturity and efficiency, with one of the key benefits being a reduction in human-caused data center disasters.