How the Business and IT Can Be Resilient during Times of Crisis
Coronavirus, active shooter, ransomware attack. Sadly, these words could easily be taken from the “breaking news” ticker of any major news outlet. And yet there are more: tornado, hurricane, flood, and power outage, to name a few. The impact of these disasters to individuals is, needless to say, traumatic. For organizations, they represent major disruption to operations, lost revenue, reputational damage, exposure to litigation, and compliance violations. In fact, Forbes, citing FEMA, notes that “more than 40% of businesses never reopen after a disaster, 29% are still in operation two years later, and for those that lose information technology for nine days or more, bankruptcy happens within a year.” The global outbreak of the coronavirus has caught many organizations unprepared and has compelled them to review (or take a first look at) business continuity and disaster recovery plans. Even digital businesses are not immune to disruption and are searching for ways to be more resilient during the current crisis.
Although business continuity planning is ultimately the responsibility of senior executives, many of them are not doing it. How can an IT leader raise awareness and convince senior leadership about the importance of business continuity planning even though only a portion of it is in the realm of IT?
In this article, learn the basics about business continuity and disaster recovery, how to prioritize resources and efforts, where to start, and if you don’t have a plan what to do right now.
The 51%
Given the prevalence and variety of business-impacting disasters, it would be reasonable to assume that any serious organization has business continuity and disaster recovery plans in place and continually updated, right? WRONG! In fact, according to Backup and Recovery Solutions Review, 51% of organizations have no business continuity plan at all. My own experience bears this out, and it is shocking just which organizations have no plans or inadequate plans – hospitals, internationally-renowned children’s educational programs, universities, and any number of small businesses. In one case, a hospital put in place a “downtime” plan to use paper records and cancel certain elective surgeries in case the electronic patient record system were to go down; but they had no plan in place to consider certain IT employees, who run many of the systems that keep the hospital running, as critical staff.
51% of organizations have no business continuity plan at all.
In another case, a global children’s educational nonprofit had, which delivers the vast majority of its educational services in a physical classroom environment, has little capacity to quickly transition classes to an online format. As a result, and due to coronavirus-compelled social distancing, they stand to lose more than $30 million in funding and tuition revenue if they cannot quickly propose alternative classroom-delivery options.
Also because of coronavirus, virtually all major U.S. universities (I teach graduate students at one of them) have canceled in-person classes, extended spring break, and have ordered professors to teach the rest of the semester online. All things considered, this is a good prescription. However, not all students have broadband internet access and some do not have laptops. Not all students and fewer professors have significant training using online learning platforms, and technology can be notoriously uncooperative. Moreover, teaching an online class is very different from teaching in-person (arguably, it is much harder), and adapting curriculum, teaching styles, and communication will not be easy to do overnight for many professors. Much of this could have been avoided or mitigated with early preparation.
The prospects for small organizations is grave. It is entirely possible that just one sick person can take out an entire small business by doing nothing more than show up for work.
To be sure, these large organizations will take a hit, but they are likely to survive long-term, especially given the coronavirus outbreak. The prospects for small organizations, is much graver indeed. According to a 2013 University of Arizona study, conducted in an office with minimal social interaction where people stayed in their offices for most of the day, a non-harmful virus introduced to one employee spread to more than 50% of surfaces within just four hours. In other words, it is entirely possible that just one sick person can take out an entire small business by doing nothing more than show up for work.
A Risk-Aware Mindset – No More Black Swans
At its core, business continuity and disaster recovery are a subset of the Risk Management Practice and depend on leadership adopting a risk-aware mindset. Being risk-aware is not the same as being risk averse. It does not suggest that leadership should over-react to every possibility of a negative outcome and fear taking any risk. Instead, a risk-aware mindset encourages leaders to consider, in advance of a risk materializing, how the organization will address it. As one client suggested to me, in an organization with a healthy risk mindset, “There are no more Black Swans.”
Being risk-aware is not the same as being risk averse.
What is the Difference Between Business Continuity and Disaster Recovery?
People often confuse the terms “business continuity” and “disaster recovery.” To be sure, they are related concepts and planning for both is often performed in tandem by many of the same people. But they are not the same. Technopedia has a simple and reasonable definition of business continuity planning: “A business continuity plan (BCP) is a plan to help ensure that business processes can continue during a time of emergency or disaster.” The assumption with BCP is that at least some key aspects of the business still need to be operational, even if in a diminished capacity, during a disaster. To use a simple example, using a generator to keep the lights on in the building during a power outage is part of business continuity planning.
Disaster Recovery (DR), by contrast, focuses on restoring interrupted and degraded services and business processes following a disaster event. In this case, priority is normally given to bringing back online the most critical business processes and services first. To continue the example from the previous paragraph, DR is restoring normal electricity after a power outage.
Top 10 Natural Disasters in Terms of Financial Loss
-
2011 Earthquake and Tsunami in Japan, $360 Billion, 15,894 fatalities.
-
1995 Great Hanshin Earthquake in Japan, $197 Billion, >5,000 fatalities.
-
2008 Sichuan Earthquake in China, $148 Billion, 87,587 fatalities.
-
2005 Hurricane Katrina in U.S., $125 Billion, 1,800 fatalities.
-
2017 Hurricane Harvey in U.S., $125 Billion, 107 fatalities.
Although both BCP and DR are critical to organizations, generally, organizations spend more effort and dedicate more resources to business continuity planning since it deals with what must remain operational at all times. The reason why both are often planned in parallel is that, quite often, business continuity plans establish a different set of operational procedures to be used during disasters, and these procedures and structures need to coordinate with disaster recovery plans and at some point dissolve as DR begins to restore normal operations.
Prioritizing What Needs To Be Protected
BCP and DR compel organizations and their leaders to prioritize – to determine what is most important to protect. While priorities will vary depending on the organization and context, a good motto to keep in mind is: Life First, then Assets.
Life first, then assets.
It may sound obvious to many (and remarkably not so obvious to others), but protecting the lives of employees, customers, and in some cases the community-at-large must come first in any business continuity plan. Aside from the fact that it is simply “the right thing” to do, protecting life is good for business. After all, without healthy and safe employees, it is pretty difficult to continue operations. Even so, it is amazing to witness how many organizations pressured their employees to come in to work and how few have been quick to implement remote working policies during the coronavirus outbreak. To be sure, some of this behavior can be attributed to poor planning – if no remote work policy exists, it takes at least a little time to create one. It also takes time to ensure that employees have the proper technology to work remotely and that proper communication channels and controls are in place.
Once life is protected, prioritize non-human assets with the goal of keeping the business running in mind.
Once life is protected, leaders must determine how to prioritize non-human assets with the goal of keeping the business running in mind. Assets include critical services, key business processes, systems, and service-delivery technologies. Many organizations, especially digital companies or those heavily supported by IT make the mistake of first identifying critical technologies and systems that need to be protected. Although it is important to catalog these key technologies, typically it is not the technology itself that needs protection; it is the outcome it delivers. Instead of starting by protecting technology, it is better to protect the services. Services represent valuable assets.
For example, during a workshop I facilitated, one major utility company client identified its automated call distribution (ACD) technology as a critical technology asset that needed to be operational throughout any disaster. The utility company used the ACD to route calls from external customers who needed to report safety issues, ask questions, sign up for new service, disconnect service, and pay their bills. Without any doubt, as a technology that supports communication with external customers, the ACD is an important component. But when asked the question, “Could you continue to provide service as a utility company without the ACD?” the answer was “Of course. It would just be less efficient for our customers to communicate with us.”
So I pressed further, “If the ACD went down, how would you address incoming customer calls?”
The answer: “Customers would call-in to our call center, and our agents would have to route them to the correct place.”
“Aha! And of the incoming customer calls, which types of calls take precedence, and in what order?”
“First, calls to report gas leaks, downed power lines, and safety issues. Second, customers who want to pay their bills. Third, customers who want to sign up for new service or transfer service to a new location.”
Ultimately, the ACD may be a critical asset for this organization. But what is more important is protecting the Customer Communication Service, and the ACD is just one component that helps to do that. In addition to the ACD, there are other ways to protect the service.
Protect the services. Services represent valuable assets.
Most organizations deliver somewhere between fifteen and fifty services, many of which are external customer facing and some directly for the benefit of the internal business customer. Organizations should focus on identifying a handful of critical customer-facing services and “can’t live without” internal business customer services. For example, a hospital may prioritize services, at a high-level, in this manner:
- Any service related to patient safety and clinical outcomes
- Services that support clinician safety
- Services that support revenue, billing, and finance
- Services that support employee payroll
For a university, prioritized services may look more like this:
- Services that support continued delivery of classroom education
- Services that support remote work for key university staff and support functions
- Services that support finance, revenue, and tuition processes
- Services that support employee payroll
- Services that support class registration
After identifying critical services, organizations should map key processes, departments and teams, and technology back to critical services. These, in turn, will undergo their own prioritization to determine which of these must continue when push comes to shove.
What are the Basic Elements of a Business Continuity Plan?
Business leaders often ask, “How long should a business continuity plan be?” and “What should be covered in it?” The answer to the first question is that the plan should be as long as it needs to be . . . some organizations have relatively short “master plans” that cover basics and defer more specific plans to organizational units. Some organizations have long and comprehensive plans that are centrally architected and updated (with input from lower levels of the organization). In terms of the basic elements that need to be covered, the BCP should include at least the following sections:
- Scope
- Definition of Disaster/Crisis
- Disaster/Crisis Priorities
- Affected Stakeholders
- Plan Owner
- Version
- Review Schedule
- Process to Report a Crisis
What is the Role of IT in Business Continuity and Disaster Recovery Planning?
What the ITIL framework calls IT Service Continuity Management (ITSCM) is the responsibility of IT. The goal of ITSCM is to ensure that critical IT services continue to be available during disasters and can be restored after disasters. The ITSCM plan is a subset of Business Continuity Management and should be part of the organization’s overall Business Continuity Plan.
Far from being IT-centric, ITSCM starts by having a conversation with the larger business and senior leadership to determine business priorities. It is crucial for IT to understand what services the business considers critical. The above-referenced examples are a good starting point. The activity of understanding business priorities and how IT supports them is called Business Impact Analysis.
Business Impact Analysis – As a subset of Risk Management and Business Continuity and Disaster Recovery, the purpose of a BIA is to understand the impact to the business that the loss of an IT service would have. This can be done qualitatively and quantitatively. It identifies the most important services to the organization and helps to define the overall strategy for risk reduction and disaster recovery. At a more granular level this analysis enables the mapping of critical service applications and technology components to critical business processes.
Vital Business Function – A critical element of a business process. To use a narrow example, an ATM machine dispenses cash and prints receipts. While printing receipts is nice; dispensing cash is critical.
Business Impact Analysis includes defining vital business functions, maximum allowable downtime, recovery time objective, and recovery point objective (discussed below), and in some cases assigning a dollar value to IT assets with regards to business continuity and disaster recovery.
IT then proceeds to identify critical services, systems, and configuration items (CIs) that support business services. It is often the case the multiple IT services and systems support a business service. Consider one of the examples from above for a university. Classroom Education is one of the services the business considers critical to continue delivering during a disaster. IT determines that the ability to deliver lessons is supported by three IT services: Educational Technology, Telecommunications, and Collaboration Services. Furthermore, the Educational Technology service provides in-class AV set-up and support, training, and an Online Learning Platform which is hosted by a third-party vendor. Telecommunication services includes landline telephone service and dial-tone, mobile telephones, and teleconferencing services. The Collaboration service includes online file-share, instant messaging, and a video conference service.
In working with the business, it is determined that in the case that students cannot attend class in-person, classes can be delivered remotely. This requires the online learning platform to be available during a disaster. It also means that while landline telephone service is not as critical, teleconferencing services are critical for students to communicate during and in-between lessons. Finally, although online file share and instant messaging are not as important during a disaster (the online learning platform provides some file share and students can text message instead of IM), videoconferencing is considered important.
IT then reviews the third-party vendor contracts for the online learning platform and for the video conferencing service to understand current capacity, what level of support is provided, and how resilient the technology is. Additionally, IT realizes that the teleconferencing system is hosted on an on-premise server and considers purchasing third-party conferencing services to be used during disaster scenarios.
IT Disaster Recovery, Maximum Allowable Downtime, Recovery Time Objective, and Recovery Point Objective
After IT and the business determine which services need to be keep running (and at what level) during a disaster, a similar process is followed for other services that are important but not as critical. Here the focus is on what the maximum allowable downtime is for the service and determining how quickly the service should be restored (recovery time objective). Additionally, recovery point objective – the amount of data loss we can tolerate – should be determined.
Maximum Allowable Downtime – MAD is the absolute maximum time that the system [or service] can be unavailable without direct or indirect ramifications to the organization.
Recovery Time Objective – RTO is the targeted duration of time and a service level within which a business process [or IT service] must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
Recovery Point Objective – A recovery point objective (RPO) is the maximum acceptable amount of data loss measured in time. It is the age of the files or data in backup storage required to resume normal operations if a computer system or network failure occurs.
Other duties of IT when planning ITSCM include:
- Identifying when a major incident occurs (i.e. determining trigger conditions),
- Defining major incident teams and escalation paths, and
- Coordinating with vendors and understanding contractual limitations.
Starting the Conversation with the Business
Many IT leaders find themselves in the awkward predicament of understanding the importance of business continuity and disaster recovery yet find little support amongst business executives. Whether it is because they equate BCDR activities with little more than having good back-up technology in place or because a vendor has sold them on so-called “cloud first” solutions, executives rarely initiate BCDR planning.
IT leaders intent on starting this conversation need to learn to speak the language of business. The following “conversation starter” questions must just do the trick:
- What do you consider the most important service the organization offers to customers?
- If one of these services went down, what would the impact be on the organization?
- When a particular business service goes down, how much money does the organization lose?
- If 20% of the workforce were gone tomorrow, how would the organization carry on?
- If the office building were to be destroyed, how would the organization continue to operate?
- If the organization fell victim to a ransomware attack or customer information were compromised and the news covered a story about this, what would be the impact on the organization?
Using one or more of these open-ended questions to provoke thought, the IT leader should suggest furthering the conversation with the executive and ultimately including other executives and business leaders to determine the scope of BCDR for the organization.
Another good practical suggestion is to use the Chief Information Security Officer (CISO) or Information Security Manager (ISM) to liaise between IT and the business. CISOs and ISMs tend to “speak” business well. And since information security is part of what an executive has overall accountability for, executives tend to be more receptive to speaking with somebody in these roles.
Areas that Are often Overlooked
Given the wide array of possible disasters and the broad scope of BCDR, many areas are often overlooked. Here are just a few:
Employee morale – It is important to consider the morale of employees who may be impacted by a disaster, psychologically, physically, and financially. Additionally, consider how to keep employees productive and engaged during extended periods of remote work.
Leadership Succession – Succession is a broad topic and can include plans for grooming leaders over a period of years, transferring knowledge, and planning for key employee retirements and departures. In this context, the question is more prosaic. If key executives are incapacitated during a disaster, who is “next in line” to be in charge, how will they know it, and how will leadership transfers be communicated?
Technological capabilities – Does the organization have technological capabilities to support telework, conferencing capabilities; intranet access; access to on-premise hardware and applications? Consider a cloud-first strategy. Do employees have training on remote tools and technologies? Does the organization require “hot” or “warm” sites? How is data backed-up and restored?
“Old School” Businesses that Benefit from Natural Disasters
Hotels
Disaster Clean-Up
Forensic Weather Experts
Self-Storage Business
Tree-removal companies
Auto-repair shops
Hardware Stores
Grocery Stores
Liquor Stores
Cash Flow, Access to working capital, and the 180-Day “Rule”
Personal financial planners have long suggested having at least three to six months’ of a “rainy day” fund to cover unexpected job loss. Likewise, in the nonprofit sector, I suggest what I call the 180-day “rule.” An organization should have access to capital such that if all sources of funding stopped immediately, it could still operate with current expenses for 180 days – that means paying employees, keeping the lights on, paying vendors, etc. For small businesses, a similar guideline should be followed. With any luck, not all revenue permanently disappears, but in the case where an entire geographical region is impacted or when there is a global pandemic, engagements may be delayed which means revenue will be booked later than anticipated. Healthy cash flow also allows organizations to continue to pay employees who may need to take off to address personal health issues or those of dependents and relatives.
Relationships with Strategic Suppliers – Not all suppliers are created equal. Organizations should identify strategic suppliers that support critical business and IT services and develop close relationships with them.
Insurance – Investigate what type of insurance can protect an organization financially in the case of disaster. For example, insurance can be purchased to address cybersecurity risks and loss of key employees.
Alternate Delivery Mechanisms – The organization should consider how services and products can continue to be delivered to consumers using multiple channels. Consider the impact on service level agreements and customer expectations.
Ability to shift work – In cases where the organization will need to suspend or reduce directly billable client work, it might be possible to focus instead on internal capacity-building and training. Although this work does not bring in revenue and may even cost money, it helps to prepare the organization for better times.
Registering employees for virtual training classes cannot be overemphasized here. We all complain about never having enough time to do training; in some cases, interruption to normal business operations gives us that time. It is about improving individual skillsets and utilizing the workforce to better the entire organization.
Getting Started
As previously mentioned, the best way to get started with BCDR is to start a conversation with key leadership. To be sure, BCDR is a complex domain, and for this reason many organizations give up before they start. There is no need to do this since the conversation and basic elements of the plan can be greatly accelerated with a facilitated workshop.
It seems appropriate to end with this quotation from President John F. Kennedy:
“When written in Chinese, the word crisis is composed of two characters. One represents danger and the other represents opportunity.”
To be fair, most linguists suggests that western speakers, including JFK, have somewhat misconstrued or overstated the original Chinese. Nevertheless, what is true is that crisis never goes away entirely. Use the current and impending disasters as an opportunity to show consumers and investors that you understand risk and are prepared.