Designing AWS Backup and Retention Policies

This guide covers the essential strategies for data durability and recovery within the AWS ecosystem, specifically focusing on Amazon RDS, EBS, S3, and EFS as part of the SAA-C03 curriculum.

Learning Objectives

Define and Differentiate between Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Configure RDS automated backups and manual snapshots, understanding their performance implications.
Implement EBS snapshot lifecycle policies using Amazon Data Lifecycle Manager (DLM).
Evaluate S3 versioning and cross-region replication (CRR) for disaster recovery.
Design a centralized backup strategy using the AWS Backup service.

Key Terms & Glossary

Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time (e.g., "we can afford to lose 15 minutes of data").
Recovery Time Objective (RTO): The maximum acceptable time to restore service after an outage.
Snapshot: A point-in-time, incremental backup of a resource (like an EBS volume or RDS instance) stored in Amazon S3.
Point-in-Time Recovery (PITR): A feature of RDS that allows restoring a database to any second within the retention period, using a combination of snapshots and transaction logs.
Retention Period: The duration for which backups are kept before being automatically deleted.

The "Big Idea"

Backup and retention policies are not just about "saving data"; they are about Business Continuity. An architect must balance the Cost of frequent backups and long-term storage against the Risk of data loss and downtime. A high-frequency snapshot policy reduces RPO but increases storage costs and can impact performance during the backup window.

Formula / Concept Box

Metric	Focus	Goal	Calculation Logic
RPO	Data Integrity	Minimize Data Loss	Time between last backup and disaster
RTO	Availability	Minimize Downtime	Time between disaster and service restoration

[!IMPORTANT] For RDS, enabling automated backups (retention > 0) is a prerequisite for Point-in-Time Recovery and creating Read Replicas.

Hierarchical Outline

Fundamental Recovery Metrics
- RPO (Data loss): Determined by backup frequency.
- RTO (Downtime): Determined by restoration speed (instance size, IOPS).
Amazon RDS Backups
- Automated Snapshots: Daily 30-minute window; 1–35 day retention (Default 7).
- Transaction Logs: Uploaded every 5 minutes to S3 for PITR.
- Manual Snapshots: User-triggered; persist indefinitely even if the instance is deleted.
Amazon EBS Protection
- Snapshots: Incremental, stored in S3 across multiple AZs.
- Data Lifecycle Manager (DLM): Automates snapshot creation/deletion based on tags.
Amazon S3 Durability
- Versioning: Protects against accidental deletes/overwrites.
- Replication (CRR/SRR): Geographic redundancy for compliance and DR.
Centralized Management
- AWS Backup: Policy-based management for RDS, EBS, EFS, and DynamoDB.

Visual Anchors

RPO vs. RTO Timeline

This diagram illustrates the relationship between the two key metrics during a failure event.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

RDS Backup Process Flow

Loading Diagram...

Definition-Example Pairs

Snapshot Frequency: The interval at which data captures are taken.
- Example: A financial app takes EBS snapshots every 1 hour to meet a strict RPO, while a dev environment takes them every 24 hours.
Final Snapshot: A manual snapshot taken immediately before a resource is deleted.
- Example: When deleting an RDS instance, AWS prompts for a final snapshot to ensure a clean state is preserved for future needs.
Cross-Region Replication: Automatically copying data to a different AWS Region.
- Example: Replicating an S3 bucket from us-east-1 to us-west-2 to ensure data survives a regional AWS outage.

Worked Examples

Scenario: Configuring RDS for High Availability and Low RPO

Problem: A production PostgreSQL database requires a 5-minute RPO and a 30-minute RTO.

Solution:

RPO Implementation: Enable Automated Backups with a retention period of 7 days. This automatically enables transaction log archiving to S3 every 5 minutes, meeting the RPO.
RTO Implementation: Use Multi-AZ Deployment. While a snapshot restore to a new instance takes minutes/hours, Multi-AZ provides a standby instance that fails over in 60-120 seconds, drastically reducing RTO compared to a cold restore.
Performance Tip: Schedule the 30-minute backup window during the least busy time (e.g., 03:00 UTC) to minimize the impact of I/O suspension.

Checkpoint Questions

What is the default retention period for RDS automated backups?
True or False: If you delete an RDS instance, your automated snapshots are kept by default.
Which service should you use to automate EBS snapshots based on resource tags?
How does S3 Versioning protect against accidental deletion differently than a snapshot?
If you restore a database from an RDS snapshot, does it overwrite the existing instance or create a new one?

▶Click to see answers

7 days (range is 0-35).
False (They are deleted unless specifically retained; manual snapshots are kept).
Amazon Data Lifecycle Manager (DLM).
Versioning keeps every version of an object; a delete request simply adds a 'delete marker' rather than removing the bits.
It creates a completely new DB instance with a new endpoint.