AWS Data Replication Methods & Disaster Recovery Strategy

This guide explores the mechanisms for data replication within the AWS ecosystem, focusing on achieving business continuity through the optimization of Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).

Learning Objectives

Define and differentiate between RPO and RTO in the context of business continuity.
Evaluate various AWS data replication methods including synchronous and asynchronous types.
Compare replication features across Amazon RDS, Amazon Aurora, and Amazon DynamoDB.
Identify the appropriate AWS Database Migration Service (DMS) task type for specific migration scenarios.
Describe block-level replication using AWS Elastic Disaster Recovery.

Key Terms & Glossary

Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time (e.g., "We can afford to lose 15 minutes of data").
Recovery Time Objective (RTO): The maximum acceptable duration of downtime before a service must be restored.
Change Data Capture (CDC): A process that observes and captures changes made to a source database so they can be applied to a target.
Active-Active Replication: A topology where multiple replicas can handle both read and write traffic simultaneously (e.g., DynamoDB Global Tables).
Pilot Light: A DR strategy where a minimal version of the environment is always running (usually just the data layer) to reduce costs while allowing for relatively fast recovery.

The "Big Idea"

Data replication is not just about copying bits; it is about balancing cost, performance, and risk. In a distributed system, replication ensures that a failure in one Availability Zone (AZ) or Region does not result in total data loss or permanent service outages. The choice of replication method (synchronous vs. asynchronous) dictates where a system sits on the spectrum between "Zero Data Loss" and "High Performance."

Formula / Concept Box

Metric	Focus	Question to Ask
RPO	Data Integrity	"How much data can we afford to lose?"
RTO	Availability	"How quickly must we be back online?"

[!IMPORTANT] The Relationship: $\text{Total Impact} = \text{Data Lost (RPO)} + \text{Productivity Lost (RTO)}$

Hierarchical Outline

Core Metrics
- RPO (Recovery Point Objective): Quantifies data loss.
- RTO (Recovery Time Objective): Quantifies downtime.
Database Replication Methods
- Amazon RDS: Multi-AZ (Sync) for HA; Read Replicas (Async) for scaling/DR.
- Amazon Aurora: Global Databases using storage-based fast replication.
- Amazon DynamoDB: Global Tables for multi-region active-active workloads.
Migration & Continuous Sync
- AWS DMS: Supports Full Load, CDC, or both.
Block-Level Replication
- AWS Elastic Disaster Recovery (EDR): Successor to CloudEndure; replicates VMs at the block level.

Visual Anchors

Data Migration Service (DMS) Workflow

Loading Diagram...

Visualizing RPO and RTO

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Synchronous Replication: Data is written to the primary and replica simultaneously before the write is acknowledged.
- Example: RDS Multi-AZ deployment where a write to the primary instance must be committed to the standby in a different AZ before completion.
Asynchronous Replication: Data is written to the primary first, and then sent to the replica with a slight delay.
- Example: RDS Read Replicas across regions; the primary handles the transaction, and the replica is updated shortly after, resulting in a small "Replica Lag."
Block-Level Replication: Moving data at the storage layer (disk blocks) rather than the file or database layer.
- Example: AWS Elastic Disaster Recovery continuously mirroring on-premises VM disks to low-cost staging resources in AWS.

Comparison Tables

AWS Database Replication Comparison

Feature	RDS Read Replicas	Aurora Global Database	DynamoDB Global Tables
Mechanism	Log-based (Async)	Storage-based (Async)	Version-based (Async)
Failover Time	Minutes	< 1 Minute	Near-instant
Typical RPO	Seconds/Minutes	< 1 Second	< 1 Second
Topology	Master-Slave	Master-Slave	Active-Active
Best Use Case	Read scaling / DR	High-perf global apps	Globally distributed NoSQL

Worked Examples

Problem: Choosing a DMS Strategy

Scenario: A company needs to migrate a 2TB production SQL database to AWS with less than 30 minutes of downtime.

Solution:

Select Task Type: Use "Migrate existing data and replicate ongoing changes" (Full Load + CDC).
Execution:
- The Full Load phase copies the 2TB while the source remains active (potentially impacting performance slightly).
- The CDC phase captures all new transactions during the copy.
Cutover: Once the "Replica Lag" is near zero, point the application to the new AWS endpoint and stop the DMS task. This minimizes downtime to just the time it takes to update connection strings.

Checkpoint Questions

What is the main difference between RPO and RTO?
Which AWS service provides sub-minute RTO for multi-region database failover?
When using AWS DMS, which option should you choose if you only want to keep a target database in sync with a source that has already been migrated?
Why does AWS Global Accelerator provide faster failover than Route 53 DNS for some disaster recovery scenarios?

Muddy Points & Cross-Refs

Aurora Multi-AZ vs. Aurora Global: Learners often confuse these. Aurora Multi-AZ happens within a single region (using shared storage), whereas Aurora Global handles cross-region replication (using dedicated replication infrastructure).
Pilot Light vs. Warm Standby: In a Pilot Light, the application servers are NOT running (they are started from AMIs/IaC during DR). In a Warm Standby, a minimal version of the app servers is always running but scaled down.
DMS Target Table Options: Remember that "Truncate" keeps the metadata (table structure) while "Drop" removes everything. Use Truncate if you have specific indexes/triggers manually defined on the target that you don't want DMS to recreate.