AWS Certified Data Engineer: Protecting Data with Resiliency and Availability

This guide focuses on the critical task of ensuring data durability, availability, and cost-efficient protection within the AWS ecosystem, covering both high availability (HA) and disaster recovery (DR) strategies.

Learning Objectives

By the end of this study guide, you will be able to:

Define and calculate RPO (Recovery Point Objective) and RTO (Recovery Time Objective).
Differentiate between High Availability and Disaster Recovery architectures.
Implement S3 Lifecycle policies for automatic tiering and data deletion.
Configure Cross-Region Replication (CRR) and Multi-AZ deployments for resiliency.
Utilize Amazon Macie, AWS Config, and S3 Versioning for data governance.

Key Terms & Glossary

RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., losing 1 hour of data).
RTO (Recovery Time Objective): The target time to restore a service after a disruption (e.g., service back online in 4 hours).
Resiliency: The ability of a workload to respond to and recover from failures.
Availability: The percentage of time that a workload is operational (e.g., "four nines" or 99.99%).
CRR (Cross-Region Replication): Automatically copying S3 objects across different AWS Regions for geographic redundancy.
WORM (Write Once, Read Many): A data storage model ensured by S3 Object Lock to prevent modification or deletion.
TTL (Time to Live): A mechanism in DynamoDB to automatically expire items from a table to reduce storage costs.

The "Big Idea"

Data protection is not a "one size fits all" task. It is a strategic balance between Business Continuity and Cost Optimization. As a Data Engineer, you must classify data into tiers (Hot vs. Cold) and apply the appropriate level of resiliency (Multi-AZ vs. Multi-Region) based on the business's tolerance for downtime (RTO) and data loss (RPO). Priority Zero is Security, but Priority One is ensuring the data exists and is accessible when the business needs it.

Formula / Concept Box

Disaster Recovery Metrics

Metric	Definition	Focus
RPO	$Time_{Fail} - Time_{Last\ Backup}$	Data Integrity / Loss Prevention
RTO	$Time_{Recovery} - Time_{Fail}$	Service Uptime / Availability

[!TIP] Pro Tip: Smaller RPO/RTO values lead to higher availability but significantly higher infrastructure costs.

Hierarchical Outline

Foundational Resilience Concepts
- Recovery Metrics: Establishing RPO and RTO with stakeholders.
- High Availability (HA): Using Multi-AZ for RDS, EBS, and MSK.
- Disaster Recovery (DR): Regional failure protection using CRR and snapshots.
Storage Resiliency Strategies
- Amazon S3: Versioning (accidental deletion), Object Lock (compliance), and Replication.
- Amazon EBS/RDS: Multi-AZ deployments for automated failover.
- Amazon Redshift: Cross-Region snapshots for regional recovery.
Data Lifecycle Management (DLM)
- Automation: Using S3 Lifecycle Policies to transition data (Standard -> IA -> Glacier).
- Archiving: Long-term storage in S3 Glacier for legal/compliance needs.
- Cleanup: DynamoDB TTL and automated S3 expiration to minimize costs.
Security & Governance
- Discovery: Amazon Macie for identifying PII in S3.
- Enforcement: AWS Config rules to ensure deletion and encryption policies are met.
- Protection: AWS Shield for DDoS and AWS Backup for centralized management.

Visual Anchors

S3 Lifecycle Flow

Loading Diagram...

Resiliency Architecture Patterns

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

S3 Versioning: Storing multiple iterations of the same object.
- Example: If a script accidentally deletes sales_data.csv, versioning allows you to restore the previous version instantly.
S3 Object Lock: Prevents an object from being deleted or overwritten for a fixed amount of time.
- Example: A financial firm must keep records for 7 years per SEC rules; Object Lock ensures no admin can delete them prematurely.
DynamoDB TTL: Automatically deleting items based on a timestamp attribute.
- Example: Deleting temporary session data for a web app after 24 hours of inactivity to keep the table size (and cost) lean.

Worked Examples

Scenario 1: Determining the DR Strategy

Problem: A healthcare company requires an architecture where data is replicated in real-time to a second region. They can tolerate almost zero data loss (RPO < 1 min) and need to be back online within 10 minutes of a regional failure. Solution:

Architecture: Active-Active or Warm Standby (Active-Passive).
S3: Enable Cross-Region Replication (CRR).
RDS: Use Cross-Region Read Replicas or Aurora Global Database.
Result: Low RPO via continuous replication; Low RTO via pre-provisioned resources in the second region.

Scenario 2: S3 Lifecycle Policy Configuration

Problem: You have logs that are accessed daily for 30 days, then rarely accessed for the next year, and must be kept for 5 years for compliance. Solution:

Transition 1: After 30 days, move from S3 Standard to S3 Standard-IA.
Transition 2: After 365 days, move to S3 Glacier Deep Archive.
Expiration: After 1,825 days (5 years), delete the object.

Checkpoint Questions

What is the main difference between S3 Cross-Region Replication and S3 Versioning regarding data protection?
Which service would you use to discover sensitive PII data before moving it to an archive?
True or False: Serverless analytics solutions like AWS Glue have built-in high availability.
If a company can tolerate losing 24 hours of data, what is their RPO?

Comparison Tables

Disaster Recovery Architecture Comparison

Strategy	Cost	RTO/RPO	Complexity
Backup & Restore	Low	Hours/Days	Simple
Pilot Light	Medium	Minutes/Hours	Moderate
Warm Standby	High	Minutes	High
Active-Active	Very High	Near Zero	Very High

Hot vs. Cold Storage Services

Requirement	Service (Hot)	Service (Cold)
Latency	Milliseconds (DynamoDB/ElastiCache)	Minutes/Hours (Glacier)
Access Frequency	High (S3 Standard)	Low (S3 Glacier / IA)
Cost per GB	High	Very Low

Muddy Points & Cross-Refs

Availability vs. Resiliency: Many students confuse these. Think of availability as the result (Is it up?) and resiliency as the method (How do we keep it up when things break?).
S3 Intelligent-Tiering: Unlike manual Lifecycle policies, this automatically moves data based on observed access patterns. Use this if your access patterns are unknown or unpredictable.
Multi-AZ vs. Read Replicas: Multi-AZ is for HA (failover); Read Replicas are primarily for scaling performance, though Cross-Region Read Replicas can be part of a DR strategy.