Disaster Recovery and Business Continuity or DRBC is traditionally a complete reproduction of a production environment in a different geographical location for the purpose of continuing business in the event of a disaster. The process used to mean preparing the hardware and applications needed in advance, at great cost, though in the modern age of Cloud based architectures and Devops tools for provisioning automation we can create the tools to provision a production environment dynamically reducing the overall cost of having a full DRBC plan in place.
If you haven’t taken the time to document your disaster recovery plan in detail before taking your environment live then if and when the need arises you’ll find that the reactive plan is a difficult and complicated process. Think of your disaster recovery plan as the total timeline to bring your services back online with no dependencies. If you’ve already built your production environment you must ask yourself “how quickly can i do this again?” and “is there anything i can do to speed up the process?”
I will typically detail the manual steps for building the entire project out for documentation purposes resulting in a linear list of steps we can then seek to automate, both reducing our build time and susceptibility to error. In this example website application i like to address the following requirements for each environment
Regions
What hosting regions are in use for this site? Is this site a multi-region architecture?
Domains and sub-domains in use
What are the domains assigned by default?
What are the custom domains assigned to this site?
What are the custom sub-domains assigned to micro-sites?”
SSL Certificates required
Is https in use? required?
Is an SSL certificate installed? if yes, what domains?”
Number of public ips needed
How many Public IP’s are needed for this architecture?
vnet size
How large of a network do we need to create? App services do not require vnets
subnet_X size
How large must the network be?
DNS Transaction
query custom domains for resolution
TTL (timing effects total queries and ability to update) TTL 1 Day or 1 Hour, 5 mins for migrations but left this way will consume all queries.
total queries Total queries may not be available but we can try to identify the DNS provider and compare to known defaults.
response time DNS response time can indicate a throttled or maxed concurrent capacity DNS lookup.
Endpoints (firewall/loadbalancer/ipwhitelisting)
Identify exposed services – HTTP, HTTPS, FTP, Webdeploy, ssh
Firwall Rules
Identify any open ports – common services are 80, 443, 21, 8175, 22
Loadbalancing Rules
If we have multiple nodes, is the load being distributed? Can we identify the LB Logic?
ip restrictions / whitelisting
Are IP restricted zones responding restricted?
SSL decoding
Is SSL certificate installed to the app services or load balancer?
Network
What are the Application IP’s, Database IPs, other Services IPs
vnets or NSGs (ip ranges)
subnets
Security groups
Verify NSG’s or ACL’s reflect the firewall rules and IP restrictions identified.
Application
Check for known application pages
config files, connection strings,
IIS /webserver
(versions, features, permissions)
.NET /stack
(versions, app pools, permissions)
Deployment Dependencies
(webdeploy / octopus agent / sftp)
Service Dependencies
Database Server
(versions, ports, users, permissions, transaction read/write times)
Search Server
ip, index strategy
NOSQL or other application Database Servers
CDN
(cache control headers, origin urls, cache clearing)
OS
Patch level/ updates
version
installed features
users
Deployment Users
Hypervisor
Is there a hypervisor layer installed on the hardware, is this a cloud based hypervisor?
Performance variation
What is the average performance and the Maximum acceptable outliers?
HA / SLA / clustering
What is the SLA based on the current configuration?
Billing
fixed cost provisioned or incremental range(autoscaling)
Hardware
Server,LB,Firewall or cloud service instance sizing
Dedicated solutions
Datacenter details
Cloud Generations
(AWS sizing / Azure Series Letter) Can we identify the sizing for the instances?
Disaster Recovery
Azure ARM or AWS Cloud Formation templates for provisioning allow rapid environment rebuilds
Webserver scaling
Recommended scaling method
Database Mirroring or replication (service recovery)
Mirror or Paas replication?
Database offsite logshipping (data recovery)
database offsite logshipping, geo replication, backup exports