A Practical Architecture Index for DRBC by Juan Garza
Disaster Recovery and Business Continuity or DRBC is traditionally a complete reproduction of a production environment in a different geographical location for the purpose of continuing business in the event of a disaster. The process used to mean preparing the hardware and applications needed in advance, at great cost, though in the modern age of Cloud based architectures and Devops tools for provisioning automation we can create the tools to provision a production environment dynamically reducing the overall cost of having a full DRBC plan in place.
If you haven’t taken the time to document your disaster recovery plan in detail before taking your environment live then if and when the need arises you’ll find that the reactive plan is a difficult and complicated process. Think of your disaster recovery plan as the total timeline to bring your services back online with no dependencies. If you’ve already built your production environment you must ask yourself “how quickly can i do this again?” and “is there anything i can do to speed up the process?”
I will typically detail the manual steps for building the entire project out for documentation purposes resulting in a linear list of steps we can then seek to automate, both reducing our build time and susceptibility to error. In this example website application i like to address the following requirements for each environment
What hosting regions are in use for this site? Is this site a multi-region architecture?
Domains and sub-domains in use
What are the domains assigned by default?
What are the custom domains assigned to this site?
What are the custom sub-domains assigned to micro-sites?”
SSL Certificates required
Is https in use? required?
Is an SSL certificate installed? if yes, what domains?”
Number of public ips needed
How many Public IP’s are needed for this architecture?
How large of a network do we need to create? App services do not require vnets
How large must the network be?
query custom domains for resolution
TTL (timing effects total queries and ability to update) TTL 1 Day or 1 Hour, 5 mins for migrations but left this way will consume all queries.
total queries Total queries may not be available but we can try to identify the DNS provider and compare to known defaults.
response time DNS response time can indicate a throttled or maxed concurrent capacity DNS lookup.
Identify exposed services – HTTP, HTTPS, FTP, Webdeploy, ssh
Identify any open ports – common services are 80, 443, 21, 8175, 22
If we have multiple nodes, is the load being distributed? Can we identify the LB Logic?
ip restrictions / whitelisting
Are IP restricted zones responding restricted?
Is SSL certificate installed to the app services or load balancer?
What are the Application IP’s, Database IPs, other Services IPs
vnets or NSGs (ip ranges)
Verify NSG’s or ACL’s reflect the firewall rules and IP restrictions identified.
Check for known application pages
config files, connection strings,
(versions, features, permissions)
(versions, app pools, permissions)
(webdeploy / octopus agent / sftp)
(versions, ports, users, permissions, transaction read/write times)
ip, index strategy
NOSQL or other application Database Servers
(cache control headers, origin urls, cache clearing)
Patch level/ updates
Is there a hypervisor layer installed on the hardware, is this a cloud based hypervisor?
What is the average performance and the Maximum acceptable outliers?
HA / SLA / clustering
What is the SLA based on the current configuration?
fixed cost provisioned or incremental range(autoscaling)
Server,LB,Firewall or cloud service instance sizing
(AWS sizing / Azure Series Letter) Can we identify the sizing for the instances?
Azure ARM or AWS Cloud Formation templates for provisioning allow rapid environment rebuilds
Recommended scaling method
Database Mirroring or replication (service recovery)
Mirror or Paas replication?
Database offsite logshipping (data recovery)
database offsite logshipping, geo replication, backup exports