Advisor

Building a Robust IT Recovery Organization

Posted October 22, 2014 | Leadership |

The attacks on the Twin Towers of the World Trade Center in New York City on 11 September 2001 exposed the need for developing business continuity planning and disaster recovery (BCP-DR) strategies. After those attacks, IT infrastructure planners and data center architects across the world started incorporating alternate sites and high availability of applications and services in their designs.

Medium- to large-scale organizations invest heavily in keeping fail-over sites ready either in "hot" mode or "cold" mode. A hot site is one that has all the servers and applications running and can quickly take over the load in case of a disaster. A cold site is one in which the infrastructure is in place but it needs to be turned on when needed. The choice between hot and cold sites depends on how much the business wants to invest in its BCP framework.

Many IT service providers (captive and noncaptive) have articulated BCP procedures as part of their contracts. These procedures are subject to audits and periodic checks to make sure that the fail-over systems are ready to take over when needed. However, there are several opportunities for improvement in organizations' approaches to and preparedness for the unexpected. There is a huge need to be highly proactive, continuously developing and implementing BCP-DR procedures. The following are a few suggestions that can help (see Figure 1):

Figure 1

Figure 1-- Architecting an IT recovery organization and defining its charter.

  • Create a chief recovery officer (CRO) organization. Having a dedicated team of recovery specialists responsible for BCP architecture and alternate assets (fail-over sites, servers, networks, and so on) is essential for ensuring unambiguous ownership of recovery procedures. As the organization grows larger in size, its vulnerability to unexpected failures and service outages increases; hence the need for a team focused on business continuity and recovery.
  • Align the reporting hierarchy. Although the CRO role can be combined with the role of chief risk officer, in large organizations with an IT spend of more than US $100 million, a separate CRO is recommended. Preparedness for recovering from disaster and developing recovery strategies and procedures is a full-time responsibility and can no longer be carried out as an additional responsibility.
  • Define clear objectives for the CRO. The recovery team has to work on a clear set of objectives aligned to the strategic goals of the organization. Here are couple of objectives, for example:
    • Reduce switching time between the primary site and secondary site
    • Reduce the recovery time (i.e., time to recover the services when outages occur)
    • Review backup data on a regular basis
  • Maintain governance schedule. The recovery team should maintain a governance schedule to review the assets associated with business continuity. This will ensure that fail-over circuits and alternate assets are regularly checked and kept active as needed. In the absence of a periodic check, the alternate site may become out of sync with the primary site. The mantra for maintaining secondary sites is "expect the unexpected!"
  • Publish recovery processes. The recovery team has to work interdependently with the IT organization (operations, service providers, and other stakeholders). It should constantly communicate with those teams and create awareness about the recovery processes. This will help in preparing them to deal with disaster when it strikes and collaborate with the recovery team.
  • Inform customers. Let customers know about the recovery framework and inform them about the steps the team is taking to roll out business continuity services. This is particularly essential for IT service delivery organizations that serve external customers. It is a good practice to share reports of audits and mock drills.
  • Prepare people. Everybody in the organization is a stakeholder when it comes to responding to an emergency. The recovery team must educate and train the rest of the organization on how to respond to an emergency situation such as an attack on the data center or knock-down of IT services. Encourage people to participate in DR drills and develop awareness about procedures to be followed.
  • Offer recovery as a service. With a mature recovery framework in place, the company can explore the idea of offering IT recovery as a service to external stakeholders. It can be a good revenue-generating opportunity and may help fund the cost of running recovery operations.

In summary, organizations need to be proactive and apply thought leadership to safeguard their IT enterprise by being prudent in creating robust recovery mechanisms and preparing for the unexpected.

[Note: The views expressed by the author are personal and do not represent his organization, Wipro Technologies.]

I welcome your comments about this Advisor and encourage you to send your insights to me at comments@cutter.com.

-- Aluru Chandra

About The Author
Aluru Chandra
Aluru Chandra has 25-plus years of experience in the IT industry. He specializes in designing learning interventions for leadership development and works with project managers and delivery managers involved in IT services. Chandra writes on topics such as business insights, people development, and customer leadership. He tweets as @aluru_chandra.