A completed BCP (Business Continuity Plan) cycle results in a formal printed manual available for reference before, during, and after disruptions have occurred. Its purpose is to reduce adverse stakeholder impacts determined by both the disruption’s scope (who and what it affects) and duration (how bad, implications last for hours, months etc). Measureable business impact analysis (BIA) “zones” (areas in which hazards and threats reside)include civil, economic, natural, technical, secondary and subsequent.

For the purposes of this article, the term disaster will be used to represent natural disaster, human-made disaster, and disruptions. Before January 1, 2000, governments anticipated computer failures, called the Y2k problem, in important public utility infrastructures like banking, power, telecommunication, health and financial industries. Since 1983, regulatory agencies like the American Bankers Association and Banking Administration Institute (BAI) required their supporting members to exercise operational continuity practices (later supported by more formal BCP manuals) that protect the public interest. Newer regulations were often based on formalized standards defined under ISO/IEC 17799 or BS 7799.

Both regulatory and global business focus on BCP arguably waned after the problem-free Y2K rollover. Some believe this lax attitude ended September 11th 2001, when simultaneous terrorist attacks devastated downtown New York City and changed the ‘worst case scenario’ paradigm for business continuity planning.[1] BCP methodology is scalable for an organization of any size and complexity. Even though the methodology has roots in regulated industries, any type of organization may create a BCP manual, and arguably every organization should have one in order to ensure the organization’s longevity.

Evidence that firms do not invest enough time and resources into BCP preparations are evident in disaster survival statistics. Fires permanently close 44% of the business affected. In the 1993 World Trade Center bombing, 150 businesses out of 350 affected failed to survive the event. Conversely, the firms affected by the September 11 attacks with well-developed and tested BCP manuals were back in business within days.

A BCP manual for a small organization may be simply a printed manual stored safely away from the primary work location, containing the names, addresses, and phone numbers for crisis management staff, general staff members, clients, and vendors along with the location of the offsite data backup storage media, copies of insurance contracts, and other critical materials necessary for organizational survival.

At its most complex, a BCP manual may outline a secondary work site, technical requirements and readiness, regulatory reporting requirements, work recovery measures, the means to reestablish physical records, the means to establish a new supply chain, or the means to establish new production centers.

Firms should ensure that their BCP manual is realistic and easy to use during a crisis. As such, BCP sits alongside crisis management and disaster recovery planning and is a part of an organization’s overall risk management. The development of a BCP manual can have five main phases:

  • Analysis
  • Solution design
  • Implementation
  • Testing and organization acceptance
  • Maintenance.

The above list is not exhaustive. There are a number of other considerations that could be included in your own plan / manual: – Risk Identification Matrix – Roles and Responsibilities (ensuring names are left out but titles are included, e.g. HR Manager) – Identification of top risks and mitigating strategies. – Considerations for resource reallocation e.g. skills matrix for larger organizations. Much of the BCP material on the internet is sponsored by consultancies who offer fee-based services for BCP solution development, however basic tutorials are freely available on the Internet for properly motivated organizations.

The analysis phase in the development of a BCP manual consists of an impact analysis, threat analysis, and impact scenarios with the resulting BCP plan requirement documentation.

Impact analysis (Business Impact Analysis, BIA)

An impact analysis results in the differentiation between critical (urgent) and non-critical (non-urgent) organization functions/ activities. A function may be considered critical if the implications for stakeholders of damage to the organization resulting are regarded as unacceptable. Perceptions of the acceptability of disruption may be modified by the cost of establishing and maintaining appropriate business or technical recovery solutions. A function may also be considered critical if dictated by law. For each critical (in scope) function, two values are then assigned:

* Recovery Point Objective (RPO) – the acceptable latency of data that will be recovered
* Recovery Time Objective (RTO) – the acceptable amount of time to restore the function

The Recovery Point Objective must ensure that the Maximum Tolerable Data Loss for each activity is not exceeded. The Recovery Time Objective must ensure that the Maximum Tolerable Period of Disruption (MTPD) for each activity is not exceeded. Next, the impact analysis results in the recovery requirements for each critical function. Recovery requirements consist of the following information:

* The business requirements for recovery of the critical function, and/or
* The technical requirements for recovery of the critical function

Threat analysis

After defining recovery requirements, documenting potential threats is recommended to detail a specific disaster’s unique recovery steps. Some common threats include the following:

* Disease
* Earthquake
* Fire
* Flood
* Online attack
* Sabotage
* Hurricane
* Utility outage
* Terrorism

All threats in the examples above share a common impact: the potential of damage to organizational infrastructure – except one (disease). The impact of diseases can be regarded as purely human, and may be alleviated with technical and business solutions. However, if the humans behind these recovery plans are also affected by the disease, then the process can fall down. During the 2002-2003 SARS outbreak, some organizations grouped staff into separate teams, and rotated the teams between the primary and secondary work sites, with a rotation frequency equal to the incubation period of the disease. The organizations also banned face-to-face contact between opposing team members during business and non-business hours. With such a split, organizations increased their resiliency against the threat of government-ordered quarantine measures if one person in a team contracted or was exposed to the disease.

Damage from flooding also has a unique characteristic. If an office environment is flooded with non-salinated and contamination-free water (e.g., in the event of a pipe burst), equipment can be thoroughly dried and may still be functional.

Definition of impact scenarios

After defining potential threats, documenting the impact scenarios that form the basis of the business recovery plan is recommended. In general, planning for the most wide-reaching disaster or disturbance is preferable to planning for a smaller scale problem, as almost all smaller scale problems are partial elements of larger disasters. A typical impact scenario like ‘Building Loss’ will most likely encompass all critical business functions, and the worst potential outcome from any potential threat.

A business continuity plan may also document additional impact scenarios if an organization has more than one building. Other more specific impact scenarios – for example a scenario for the temporary or permanent loss of a specific floor in a building – may also be documented. Organizations sometimes underestimate the space necessary to make a move from one venue to another. It is imperative that organizations consider this in the planning phase so they do not have a problem when making the move.

Solution design

The goal of the solution design phase is to identify the most cost effective disaster recovery solution that meets two main requirements from the impact analysis stage.

For IT applications, this is commonly expressed as:

  • The minimum application and application data requirements
  • The time frame in which the minimum application and application data must be available

Disaster recovery plans may also be required outside the IT applications domain, for example in preservation of information in hard copy format, loss of skill staff management, or restoration of embedded technology in process plant. This BCP phase overlaps with Disaster recovery planning methodology. The solution phase determines:

  • the crisis management command structure
  • the location of a secondary work site (where necessary)
  • telecommunication architecture between primary and secondary work sites
  • data replication methodology between primary and secondary work sites
  • the application and software required at the secondary work site, and
  • the type of physical data requirements at the secondary work site.

Implementation

The implementation phase, quite simply, is the execution of the design elements identified in the solution design phase. Work package testing may take place during the implementation of the solution, however; work package testing does not take the place of organizational testing.

Testing and organizational acceptance

The purpose of testing is to achieve organizational acceptance that the business continuity solution satisfies the organization’s recovery requirements. Plans may fail to meet expectations due to insufficient or inaccurate recovery requirements, solution design flaws, or solution implementation errors. Testing may include:

  • Crisis command team call-out testing
  • Technical swing test from primary to secondary work locations
  • Technical swing test from secondary to primary work locations
  • Application test
  • Business process test

At minimum, testing is generally conducted on a biannual or annual schedule. Problems identified in the initial testing phase may be rolled up into the maintenance phase and retested during the next test cycle.

After the completion of the analysis phase, the business and technical plan requirements are documented in order to commence the implementation phase. A good asset management program can be of great assistance here and allow for quick identification of available and re-allocatable resources. For an office-based, IT intensive business, the plan requirements may cover the following elements which may be classed as ICE (In Case of Emergency) Data:

  • The numbers and types of desks, whether dedicated or shared, required outside of the primary business location in the secondary location
  • The individuals involved in the recovery effort along with their contact and technical details
  • The applications and application data required from the secondary location desks for critical business functions
  • The manual workaround solutions
  • The maximum outage allowed for the applications
  • The peripheral requirements like printers, copier, fax machine, calculators, paper, pens etc.

Other business environments, such as production, distribution, warehousing etc will need to cover these elements, but are likely to have additional issues to manage following a disruptive event.

Classification of Disasters

Disaster can be classified in two broad categories.

  • Natural disasters- Preventing a natural disaster is very difficult, but it is possible to take precautions to avoid losses. These disasters include flood, fire, earthquake, hurricane, etc
  • Man made disasters- These disasters are major reasons for failure. Human error and intervention may be intentional or unintentional which can cause massive failures such as loss of communication and utility. These disasters include accidents, walkouts, sabotage, burglary, virus, intrusion, etc.

General steps to follow while creating BCP/DRP

  • Identify the scope and boundaries of business continuity plan. First step enables us to define scope of BCP. It provides an idea for limitations and boundaries of plan. It also includes audit and risk analysis reports for institution’s assets.
  • Conduct a business impact analysis (BIA). Business impact analysis is the study and assessment of effects to the organization in the event of the loss or degradation of business/mission functions resulting from a destructive event. Such loss may be financial, or less tangible but nevertheless essential (e.g. human resources, shareholder liaison)
  • Sell the concept of BCP to upper management and obtain organizational and financial commitment. Convincing senior management to approve BCP/DRP is key task. It is very important for security professionals to get approval for plan from upper management to bring it to effect.
  • Each department will need to understand its role in plan and support to maintain it. In case of disaster, each department has to be prepared for the action. To recover and to protect the critical functions, each department has to understand the plan and follow it accordingly. It is also important for each department to help in the creation and maintenance of its portion of the plan.
  • The BCP project team must implement the plan. After approval from upper management plan should be maintained and implemented. Implementation team should follow the guidelines procedures in plan.
  • NIST tool set can be used for doing BCP. National Institute of Standards and Technologies has published tools which can help in creating BCP.

Control measures in recovery plan

Control measures are steps or mechanisms that can reduce or eliminate computer security threats. Different types of measures can be included in BCP/DRP.

  • Preventive measures – These controls are aimed at preventing an event from occurring.
  • Detective measures – These controls are aimed at detecting or discovering unwanted events.
  • Corrective measures – These controls are aimed at correcting or restoring the system after disaster or event.

These controls should be always documented and tested regularly.

Strategies

Prior to selecting a disaster recovery strategy, a disaster recovery planner should refer to their organization’s business continuity plan which should indicate the key metrics of recovery point objective (RPO) and recovery time objective (RTO) for various business processes (such as the process to run payroll, generate an order, etc). The metrics specified for the business processes must then be mapped to the underlying IT systems and infrastructure that support those processes.

Once the RTO and RPO metrics have been mapped to IT infrastructure, the DR planner can determine the most suitable recovery strategy for each system. An important note here however is that the business ultimately sets the IT budget and therefore the RTO and RPO metrics need to fit with the available budget. While most business unit heads would like zero data loss and zero time loss, the cost associated with that level of protection may make the desired high availability solutions impractical.

The following is a list of the most common strategies for data protection.

  • Backups made to tape and sent off-site at regular intervals (preferably daily)
  • Backups made to disk on-site and automatically copied to off-site disk, or made directly to off-site disk
  • Replication of data to an off-site location, which overcomes the need to restore the data (only the systems then need to be restored or synced). This generally makes use of storage area network (SAN) technology
  • High availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data

In many cases, an organization may elect to use an outsourced disaster recovery provider to provide a stand-by site and systems rather than using their own remote facilities.

In addition to preparing for the need to recover systems, organizations must also implement precautionary measures with an objective of preventing a disaster in the first place. These may include some of the following:

  • Local mirrors of systems and/or data and use of disk protection technology such as RAID
  • Surge protectors — to minimize the effect of power surges on delicate electronic equipment
  • Uninterruptible power supply (UPS) and/or backup generator to keep systems going in the event of a power failure
  • Fire preventions — alarms, fire extinguishers
  • Anti-virus software and other security measures