Tuesday 28 January 2014

Thinking about Active Directory Recovery

Probably the most utilised and under considered software component in your organisation. Microsoft Active Directory underpins almost every authentication activity.
  •           Workstation login
  •           Printer
  •           Email Access
  •           Federation to external resources
  •           File access
  •           Delegated access to resources
  •           SharePoint
  •           Office Communications server/Lync
  •           SQL server
  •           IIS websites

When Active Directory fails the fallout will be enormous and most likely its not currently in scope for your Business Continuity Plans for major application failure or when it is, it is poorly considered.

In my experience with customers even if it is in scope Active Directory recovery will be a restore from tape and follow the Microsoft recovery guide here:  

Unfortunately the steps contained in that document are not a recovery process at all but rather a set of steps that you will need to undertake when the problem occurs.  

Recovery is also not a simple Backup and restore system state process when there are multiple DC’s and worse when there are constraints on expertise and/or WAN.

Two types of constraint come to mind when a significant Active Directory issue occurs that might require a full recovery of AD.

1. Political/People/Management
a.       ‘War room’ committee will require invocation and plans for recovery process commence
b.      People need to be mobilised
c.       The right skills need to be available to perform the recovery, as it’s a complex task
d.      The recovery process needs to be current and valid
e.      Every 60 minutes management will want an update on progress
2   
2   2. Technical
a.       If running multiple Domain controllers, each domain controller needs to be isolated from all others to ensure bad data doesn’t replicate
b.      Recovery may require multiple backup versions to ensure the recovery doesn’t recover a previous ‘bad’ backup.
c.       AD health needs to checked and confirmed to ensure all services are back up and operational.
d.       Recovery process might have to pause recovery of various servers to ensure the correct restore process occurs
e.      Rolling the RID forward needs to occur to ensure there isn’t an issue with old corrupt data becoming authoritative and overwriting good recovered data.

My experience with business disasters has been that as a problem becomes larger more people are involved and the process of recovering the failed system slows down due to people becoming involved and without a good rollback position, people are more reluctant to attempt the recovery without more time and additional people becoming involved. This becomes a nightmare of epic proportions.


Recently we were invited to prove a recovery of Active Directory against Microsoft Professional Services for a customer of ours to highlight the difference in TTTR (Total Time To Recover).

Microsoft PSO and their recovery process required 17 hours to restore AD

Our Software approach was 1 hour and 5 minutes and we proved this 3 times. 

In addition to the recovery our software creates the recovery process and automates it. It also allows the business to test full AD recovery without risk.

Whether your organisation needs to be able to recover quickly is down to the business leaders but in many cases the business doesn’t understand the implications of a full forest outage and just how much business may be affected and inoperable.


No comments: