Disaster recovery Management
Deployment of a Disaster Recovery Site and setting up replication of mission critical data/ application/ servers between Primary Site and DR site is critical in today’s business scenario. However managing such an infrastructure is more critical and of high importance.
With evolution of ERP and specialized Line of Business Application and core banking solutions for Financial institutions it is but imperative that businesses ensure that data is available 24×7, 365 days a year. It is also important that the data should be backed up regularly and should be available on demand. Also more important is the fact that loss of financial transactions due to server unavailability could cause huge monetary loss to the business.
These situations have led businesses to deploy Secondary Datacenters also referred as Disaster Recovery Sites… wherein data from the primary site is replicated onto the DR site on a constant basis – such that if there is a disaster at the Primary Site and the Primary Datacenter is shutdown – the DR site can be set operational as soon as possible.
Recovery is a part of Business Continuity practice.
How Disaster recovery Management Works
Having a DR site requires one to ensure that the DR site should actually work if any disaster strikes at the Primary Site. To ensure this regular DR Drills, audits etc, hve to be performed.
Performing such drills requires Network engineers, IT security experts, Database Administrators, Application experts, Server Expert, Storage expert to be present at Primary or DR site to be in sync with each other and requires everyone to do their job with precision. Cause any mistake by one person would cause the entire exercise to go for a toss and would create a huge unplanned downtime.
In datacenter environments wherein DR sites are involved it is important to know how much RPO (Recovery Point Objective) and how much RTO (Recovery Time Objective) would be there. RPO meaning if there is a disaster at the Primary Site – then how much amount of data would be lost.
RTO meaning after the disaster has struck how much time will it take to restore normal operations and make systems available to the users.
Generally the lower the RPO and RTO – more expensive is the solution.
One can achieve zero RPO by deploying near line DR site wherein data would remain synchronized at any given fraction of second. However the near DR site has to exist within a distance of 200km of the primary Datacenter.
And if the DR site is beyond 200kms then RPO could be less than 15 mins which is an acceptable standard or it could be more. Higher the RPO higher is the loss of data.
The challenge is even if required systems have been put in place for a DC-DR scenario – how does one ensure that RPO would be less than 15mins. How does one ensure that RTO would be less than 6hours or the time agreed on the SLA (service level Agreement).
Even though one has a DR site in place is IT sure that the above would be true.
Is IT sure that required patches of applications/ operating systems/ database is at the same level or state as that of the DC site.
Is IT sure that the recent password is in use at the DR site.
Is IT sure that the DR Drill book is updated with the latest procedures.
How much time it will take for the data to be replicated to DR before failover can be performed.
How prepared is IT to handle application recovery in case experts are not available…
Such questions keep on coming.
As they say, ‘You can’t manage what you can’t measure.’ However, measurement of a company’s recovery solution’s ability to perform has been challenging, at best. Testing is a painful, involved, disruptive, and costly process that requires significant coordination between groups and is also labour-intensive. As a result, most companies have a tendency to test their recovery solution less than they would like to.
Today’s IT operations are complicated and dynamic. Every day, things change in a business environment that could potentially cause restarting operations at a remote site to fail. The worst time for a company to find that out is when they need their recovery solution to perform as designed to help them get their business back up and running after an outage or a disaster incident.