AWS Disaster Recovery
AWS Disaster Recovery
There are four Disaster Recovery scenarios that highlight usage of AWS and compare AWS with traditional DR methods:
- Backup and Restore
- Pilot Light for Simple Recovery into AWS
- Warm Standby Solution
- Multi-site Solution
Backup and Restore
In most traditional environments, data is backed up to tape and sent off-site regularly. Your recovery time will be the longest using this method. Amazon S3 is an ideal destination for backup data, as it is designed to provide 99.999999999% (11 9s) durability of objects over a given year. Transferring data to and from Amazon S3 is typically done via the network, and is therefore accessible from any location. There are many commercial and open source backup solutions which backup to Amazon S3. The AWS Snowball service enables transfers of very large data sets by shipping storage devices directly to AWS.
The AWS Storage Gateway service enables snapshots of your on-premise data volumes to be transparently copied into Amazon S3 for backup. You can subsequently create local volumes or AWS EBS volumes from these snapshots.
For systems running on AWS, customers also back up into Amazon S3. Snapshots of Elastic Block Store (EBS) volumes and backups of Amazon RDS are stored in Amazon S3. Alternatively, you can copy files directly into Amazon S3, or you can choose to create backup files and copy them to Amazon S3. There are many backup solutions which store backup data in Amazon S3, and these can be used from Amazon EC2 systems as well.
Data backup options to S3 from on-site infrastructure, or from AWS.
The backup of your data is only half the story. Recovery of data in a disaster scenario needs to be tested and achieved quickly and reliably. Customers should ensure that their systems are configured to appropriate retention of data, security of data, and have tested their data recovery processes.
Restoring a system from S3 backups to AWS EC2
Key steps for backup and restore:
- Select an appropriate tool or method to back up your data into AWS.
- Ensure that you have an appropriate retention policy for this data.
- Ensure that appropriate security measures are in place for this data, including encryption and access policies.
- Regularly test the recovery of this data and restoration of your system.
Pilot Light for Quick Recovery into AWS
The idea of the pilot light is an analogy that comes from the gas heater. In a gas heater, a small idle flame that’s always on can quickly ignite the entire furnace to heat up a house as needed. This scenario is similar to a Backup and Restore scenario, however, you must ensure that you have the most critical core elements of your system already configured and running in AWS (the pilot light). When the time comes for recovery, you would then rapidly provision a full scale production environment around the critical core.
Infrastructure elements for the pilot light itself typically include your database servers, which would be replicating data to Amazon EC2. Depending on the system, there may be other critical data outside of the database that needs to be replicated to AWS. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS can quickly be provisioned (the rest of the furnace) to restore the complete system.
To provision the remainder of the infrastructure to restore business critical services, you would typically have some pre-configured servers bundled as Amazon Machine Images (AMIs), which are ready to be started up at a moment’s notice. When starting recovery, instances from these AMIs come up quickly and find their role within the deployment around the pilot light. From a networking point of view, you can either use Elastic IP Addresses (which can be pre-allocated in the preparation phase for DR) and associate them with your instances, or use Elastic Load Balancing to distribute traffic to multiple instances. You would then update your DNS records to point at your Amazon EC2 instance or point to your Elastic Load Balancing using a CNAME.
For less critical systems, you can ensure that you have any installation packages and configuration information available in AWS, for example, in the form of an EBS snapshot. This will speed up the application server setup, because you can quickly create multiple volumes in multiple Availability Zones, to attach to EC2 instances. You can then install and configure accordingly.
The Pilot Light method will give you a quicker Recovery Time than the “Backup and Restore” scenario above, because the core pieces of the system are already running and are continually kept up to date. There are still some installation and configuration tasks to fully recover the applications. AWS enables you to automate the provisioning and configuration of the infrastructure resources, which can be a significant benefit to save time and help protect against human errors.
Preparation Phase:
The following figure shows the preparation phase, in which you need to have your regularly changing data replicated to the pilot light, the small core around which the full environment will be started in the recovery phase. Your less frequently updated data such as operating systems and applications can be periodically updated and stored as Amazon Machine Images (AMIs).
- Set up EC2 instances to replicate or mirror data.
- Ensure that you have all supporting custom software packages available in AWS.
- Create and Maintain Amazon Machine Images (AMI) of key servers where fast recovery is required.
- Regularly run these servers, test them, and apply any software updates and configuration changes.
- Consider automating the provisioning of AWS resources.
Recovery Phase:
To recover the remainder of the environment around the pilot light, you would start your systems from the Amazon Machine Images (AMIs) in minutes on the appropriate instance types. For your dynamic data servers, you can resize them to handle production volumes as needed or add capacity accordingly. Horizontal scaling, if possible, is often the most cost effective way and scalable approach to add capacity to a system, however, it’s also possible to pick larger EC2 instance types and thus scale vertically. From a networking perspective, any required DNS updates can be done in parallel.
Once recovered, you should ensure that redundancy is restored as quickly as possible. While a failure of your DR environment shortly after your production environment failed is unlikely, you need to be aware of this risk. Continue to take regular backups of your system and consider additional redundancy at the data layer.
The recovery phase of the Pilot light scenario.
Key points for recovery:
- Start your application EC2 instances from your custom AMIs.
- Resize and/or scale any database / data store instances, where necessary.
- Change DNS to point at the EC2 servers.
- Install and configure any non-AMI based systems, ideally in an automated fashion.
Warm Standby Solution in AWS
A warm standby solution extends the pilot light elements and preparation. It further decreases the recovery time because in this case, some services are always running. By identifying your business-critical systems, you would fully duplicate these systems on AWS and have them always on.
These servers can be running on a minimum sized fleet of EC2 instances on the smallest sizes possible. This solution is not scaled to take a full-production load, but it is fully functional. It may be used for non-production work, such as testing, quality assurance, and internal use, etc.
In a disaster, the system is scaled up quickly to handle the production load. In AWS, this can be done by adding more instances to the load balancer and by resizing the small capacity servers to run on larger EC2 instance types. As stated above, horizontal scaling, if possible, is often preferred over vertical scaling.
Preparation Phase:
The following diagram shows the preparation phase for a warm standby solution, in which an on-site and an AWS solution run side by side.
The preparation phase of the “warm standby” scenario.
Key points for preparation:
- Set up EC2 instances to replicate or mirror data.
- Create and maintain Amazon Machine Images (AMIs).
- Run your application using a minimal footprint of EC2 instances or AWS infrastructure.
- Patch and update software and configuration files in line with your live environment.
Recovery Phase:
In the case of failure of the production system, the standby environment will be scaled up for production load and DNS records are changed to route all traffic to AWS.
The recovery phase of the “warm standby” scenario.
Key points for recovery:
- Start applications on larger EC2 Instance types as needed (vertical scaling).
- Increase the size of the EC2 fleets in service with the Load Balancer (horizontal scaling).
- Change the DNS records so that all traffic is routed to the AWS environment.
- Consider using Auto scaling to right-size the fleet or accommodate the increased load.
Multi-Site Solution deployed on AWS and On-Site
A multi-site solution runs in AWS as well as on your existing on-site infrastructure in an active-active configuration. The data replication method that you employ will be determined by the recovery point (see RPO above) you choose. Various replication methods exist (see below).
A weighted DNS service, such as Amazon Route 53, is used to route production traffic to the different sites. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure.
In an on-site disaster situation, you can adjust the DNS weighting and send all traffic to the AWS servers. The capacity of the AWS service can be rapidly increased to handle the full production load. EC2 Auto Scaling can be used to automate this process. You may need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS.
The cost of this scenario is determined by how much production traffic is handled by AWS in normal operation. In the recovery phase, you only pay for what you use in addition and for the duration that the DR environment is required at full scale. You can further reduce cost by purchasing Reserved Instances for your “always on” AWS servers.
Preparation Phase:
In the figure below, we see the use of DNS to route a portion of the traffic to the AWS site. The application on AWS may access data sources in the on-site production system. Data is replicated or mirrored to AWS infrastructure.
The preparation phase of the “Multi-Site” scenario.
Key points for preparation:
- Set up your AWS environment to duplicate your production environment.
- Set up DNS weighting or similar technology to distribute incoming requests to both sites.
Recovery Phase:
The figure below shows what happens when a disaster occurs on-site. Traffic is cut over to the AWS infrastructure by updating DNS.
The recovery phase of the “multi-site” scenario involving on-site and AWS infrastructure.
Key points for recovery:
- Change the DNS weighting, so that all requests are sent to the AWS site.
- Have application logic for failover to use the local AWS database servers.
- Consider using Auto scaling to automatically right-size the AWS fleet.
You can further increase the availability of your multi-site solution by designing Multi-AZ architectures.
Conclusion:
Disaster events pose a threat to your workload availability, but by using AWS Cloud services you can mitigate or remove these threats. By first understanding business requirements for your workload, you can choose an appropriate DR strategy. Then, using AWS services, you can design an architecture that achieves the recovery time and recovery point objectives your business needs.