Being a conductor is a hybrid profession, because fundamentally it is like being a coach, a trainer, an editor, and a director. – Michael Tilson Thomas
It was a Friday evening, and people working in the bank were preparing to close their work and start enjoying the much-awaited weekend. Suddenly, a couple employees noticed that their computer application was not working. When they called the help desk, they were told that it was a major breakdown in the IT systems and was under investigation.
A major incident was called out. The incident lasted throughout the weekend, and the systems were finally restored late Sunday evening. Many technical teams were involved in troubleshooting the issue, along with a couple of the vendors that supplied the equipment used in the infrastructure.
Throughout the incident, one group of people worked nonstop, chairing the diagnostic conversations all along the way until a solution was implemented and the issue resolved: the major incident managers (MIMs).
What’s an MIM to Do?
As per the IT Infrastructure Library (ITIL), an incident is defined as a disruption in the normal functioning of a service or a reduction in quality of a service that may lead to loss of productivity for customers. A major incident is one where there a significant impact to business, calling for an urgent response that is different from the routine procedure followed for incidents.
Outside the context of IT, a major incident may be called out by a competent authority when a disaster strikes a city or a country or when a security situation unfolds affecting several people. There are standard procedures defined to manage these situations.
In the context of IT, a major incident is called out when a site or a region (part of an IT estate) is knocked off or a large group of users is unable to access the services of business applications. One of the first few steps in response to a major incident is to open a bridge and pull in experts from technical areas, business applications, and stakeholders to participate in the discussion. The bridge is chaired by an MIM, and technical experts share their findings and steps being taken to diagnose and resolve the issue.
There are several debates about whether an MIM should be technically skilled or business savvy or process oriented. As illustrated in Figure 1, it is clear the MIM plays a crucial role in dealing with the incident, facilitating diagnostic procedures, assisting technical teams (within and outside the organization), and getting the enterprise back on track in the shortest possible time. The MIM is at the center of the situation and is responsible for orchestrating conversations and restoring normalcy in the shortest possible time. MIMs have to interact with multiple stakeholders including (and not limited to) infra managers, application owners, business sponsors, internal support groups, external vendor teams such as OEM suppliers, service providers, and so on.
Competencies of an Effective MIM
During a major incident situation, everyone looks to the MIM for leadership and direction. But we are not talking about an extraordinary person with superhuman qualities. Leading major incident calls and associated situations requires many competencies that fall into three main categories: knowledge, skills, and attitude.
Knowledge
MIMs have to have knowledge of the estate and its components, including the following areas:
-
Bills of material for hardware and software assets
-
Licenses for software components
-
Networking diagrams
-
Contact numbers of technical experts, vendors, and support partners
-
Service-level agreements with every external service provider
-
Operating-level agreements with internal service providers
-
Basic knowledge of business applications and the servers hosting them
-
Data backup policies and location of where backups are stored
-
Basic understanding of technologies deployed in the estate
-
Knowledge of cultures (of the countries and geographies from where the operations are running)
Skills
There are many skills that are essential for an MIM, including but not limited to the following:
-
Flawless communication — the ability to articulate status updates in simple words
-
Call control — running a technical bridge and ensuring people are focusing on how to fix the issue and not on whom to blame for the incident
-
Negotiation skills — negotiating with internal and external stakeholders for quicker actions and shorter turn-around times
-
Probing skills — asking questions about troubleshooting steps being taken, when to expect results, and so on
-
Ability to challenge experts and clarify their hypotheses
-
Ability to prevent buck-passing between technical teams
-
Collaboration skills — bringing different groups to work under one common goal of dousing fire
-
Orchestration skills — ensuring loud voices don’t cloud quieter but insightful voices
Attitude
Leading major incident management calls requires leadership attitude. Here are some of the areas of attitude that make MIMs successful and effective:
-
End-to-end ownership — constantly monitoring progress and pushing for quicker resolution
-
Honesty and integrity
-
Courage to convey bad news to senior leadership so that they know ground reality as it is
-
Standing up for what is good for the enterprise even if it means angering some so-called experts and heavyweights in the organization
-
Transparency in reporting reality to business — not bucking under pressure and not reporting false status
-
Learning orientation — attitude of learning from every major incident and improvising processes and systems to make them less vulnerable
-
Gratitude — acknowledging the contributions made by different people in resolving the incident, building loyalty and improving commitment
Major incidents don’t happen often. But when they do happen, they test the agility and maturity of the organization. People participating in the major incident calls have to practice call etiquettes for smooth conversations and quicker resolution. It is not uncommon to see experts trying to show off their technical competence on major incident calls.
The Etiquette of a Major Incident Call
-
Basic telephone etiquette, including:
-
Not placing the call on hold
-
Muting the phone line while not speaking (to prevent background noise)
-
Speaking in a language commonly understood by everyone on the call
-
No side-chats
-
Polite and professional tone
-
Identify yourself every time you join and leave the bridge
-
-
Punctuality, joining bridge call as soon as required
-
Avoid blaming anyone for the incident
-
When providing status updates, keep in mind that everyone who joins a major incident call is not a technical expert
-
Do not take remarks personally; focus on resolution instead
-
When required, escalate the matter to higher-level leaders and seek their support
As Kurt Masur, famous German musician and conductor said, “you have to change your mind with every orchestra because every orchestra has a different character.” So is the case with major incident calls: each one is different.
Conclusion
The practice of major incident management has to be developed by investing resources, training people, and creating career paths for practitioners. When a fire in a building is doused, everyone is busy thanking the firefighters. Who is thanking the fire-engine driver for bringing them to the spot?
Organizations have a tendency to see technical experts as key contributors in resolving major incidents. In the process, incident managers and their work are overlooked by senior management. Most CIOs know that MIMs are crucial for keeping the IT estate free from disruption, and therefore recognize their contributions. They are also sure to empower MIMs to take critical decisions and drive teams for closure. They know that because of the relentless work of MIMs, they are better able to sleep peacefully at night.
[The views expressed by the author are his personal views and do not represent his organization.]