21 Apr 2011

Designing for failure with Amazon Web Services 

By - Amazon 1 Comment

Avoid single points of failure. You can and should assume everything will fail.

Start by listing all major points of your architecture, then break it down further, and then maybe one more level. Now review each of these points and consider what would happen if any of these failed.

You need to include redundancy or failback plans for each of these areas at a minimum:

CloudFront

Have an alternate solution for cloudfront if you depend on it.– MaxCDN, Edgecast, Akamai, etc.

Elastic Compute Cloud (EC2)

Use Mutiple AZs, and Multiple Regions redundantly for your EC2 instances. Be prepared to utilize an alternate cloud provider in the worst case scenario of multiple simultaneous region failures.   This means having offsite backups of all data required to resume your business.

Elastic Load Balancing (ELB)

Be prepared for ELB failures. Consider having a backup plan in place for ELBs using HAproxy as the load balancer. By implementing low DNS TTLs (time-to-live) you can ensure a quick fail-over if needed.

As a best practice, configuring your Elastic Load Balancers in more than one zone is recommended. However, note that the round-robin is based on the number of availability zones, not the total number of server instances in the zones being serviced. Traffic will be routed evenly across all availability zones. Therefore, you should consider having about the same number of frontend servers in each zone to evenly distribute the load.

Important! An ELB will round-robin traffic between all selected availability zones regardless of whether or not a serviceable instance exists in that zone. When creating an ELB, you must only select the availability zones in which instances are currently running and attached, otherwise a portion of your requests will return a 503 error.

Amazon Machine Image (AMI)

Ensure you use your own AMIs and are not relying on a third party to provide AMIs. If your third-party AMI is removed from Amazon, you will not be able to start new instances with your existing configuration.

Auto Scale Groups (ASG)

Ensure your ASG is sized properly. Don’t assume that your group will always scale immediately; instances aren’t always available to be created on-demand. It’s a good practice to scale preemptive to issues to ensure that you have resources available before they are required.

Elastic IP (EIP)

Use elastic IP addresses for addresses that need static addressing.  If you do not use EIPs your IP address will change when you stop/start an instance. Be sure to reattach any EIP if you stop an EBS-backed instance as it will be unattached when you stop it.

Relational Database Service (RDS)

Use Multi AZs to protect against a single zone failure.  Ensure you have an offsite copy of your data being exported off-site automatically at periodic intervals.

Elastic Block Store (EBS)

Perform snapshots for all EBS volumes. This should be fully automated for every EBS volume with data that isn’t ephemeral.

Other areas that may fail and should be reviewed:

- Elastic MapReduce (EMR)
- Flexible Payments Service (FPS)
- Route 53 Service (R53)
- Simple Email Service (SES)
- Simple Notification Service (SNS)
- Simple Queue Service (SQS)
- Simple Storage Service (S3)
- SimpleDB (SDB)
- AWS CloudFormation
- AWS Elastic Beanstalk
- AWS Import/Export Service
- AWS Management Console

General better practices

- Separate Amazon Accounts for DEV, TEST, STAGE, PRODUCTION
- Use Multiple Availability Zones (AZs) whenever possible
- Use Multiple Regions
- Backup data to different region(s)
- Be prepared to lose an entire region
- Bootstrap your instances – Instances should boot and ask “Who am I & what is my role?”
- Build security at every layer
- Create distinct security groups for each  amazon EC2 cluster
- Use group based rules for control between layers
- Restrict external access
- Encrypt data at rest
- Encrypt data in-transit
- Consider encrypted filesystems
- Rotate your AWS credentials
- Use MultiFactor authentication, Duosecurity.com or Yubico.com are great examples.

Disable Delete on Termination

Though EBS volumes created and attached to an instance at instantiation are preserved through a “stop”/”start” cycle, by default they are destroyed and lost when an EC2 instance is terminated. This behavior can be changed with a delete-on-termination boolean value buried in the documentation for the –block-device-mapping option of ec2-run-instances.

Disable Instance Termination

$ ec2-run-instances  --disable-api-termination  i-nstance

Set Shutdown Behavior

Using the legacy S3 based AMIs, either of the above terminates the instance and you lose all local and ephemeral storage (boot disk and /mnt) forever. Hope you remembered to save the important stuff elsewhere!

A shutdown from within an EBS boot instance, by default, will initiate a “stop” instead of a “terminate”. This means that your instance is not in a running state and not getting charged, but the EBS volumes still exist and you can “start” the same instance id again later, losing nothing.

Have an offsite copy of your data

References
http://media.amazonwebservices.com/AWS_Cloud_Best_Practices.pdf

http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011

One Response to “Designing for failure with Amazon Web Services”

  1. Ute Larsen says:

    After a 72 hour outage, it would be hard to convince my CTO.

Leave a Reply