AWS Certified Data Engineer: Protecting Data with Resiliency and Availability
Protect data with appropriate resiliency and availability
AWS Certified Data Engineer: Protecting Data with Resiliency and Availability
This guide focuses on the critical task of ensuring data durability, availability, and cost-efficient protection within the AWS ecosystem, covering both high availability (HA) and disaster recovery (DR) strategies.
Learning Objectives
By the end of this study guide, you will be able to:
- Define and calculate RPO (Recovery Point Objective) and RTO (Recovery Time Objective).
- Differentiate between High Availability and Disaster Recovery architectures.
- Implement S3 Lifecycle policies for automatic tiering and data deletion.
- Configure Cross-Region Replication (CRR) and Multi-AZ deployments for resiliency.
- Utilize Amazon Macie, AWS Config, and S3 Versioning for data governance.
Key Terms & Glossary
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., losing 1 hour of data).
- RTO (Recovery Time Objective): The target time to restore a service after a disruption (e.g., service back online in 4 hours).
- Resiliency: The ability of a workload to respond to and recover from failures.
- Availability: The percentage of time that a workload is operational (e.g., "four nines" or 99.99%).
- CRR (Cross-Region Replication): Automatically copying S3 objects across different AWS Regions for geographic redundancy.
- WORM (Write Once, Read Many): A data storage model ensured by S3 Object Lock to prevent modification or deletion.
- TTL (Time to Live): A mechanism in DynamoDB to automatically expire items from a table to reduce storage costs.
The "Big Idea"
Data protection is not a "one size fits all" task. It is a strategic balance between Business Continuity and Cost Optimization. As a Data Engineer, you must classify data into tiers (Hot vs. Cold) and apply the appropriate level of resiliency (Multi-AZ vs. Multi-Region) based on the business's tolerance for downtime (RTO) and data loss (RPO). Priority Zero is Security, but Priority One is ensuring the data exists and is accessible when the business needs it.
Formula / Concept Box
Disaster Recovery Metrics
| Metric | Definition | Focus |
|---|---|---|
| RPO | Data Integrity / Loss Prevention | |
| RTO | Service Uptime / Availability |
[!TIP] Pro Tip: Smaller RPO/RTO values lead to higher availability but significantly higher infrastructure costs.
Hierarchical Outline
- Foundational Resilience Concepts
- Recovery Metrics: Establishing RPO and RTO with stakeholders.
- High Availability (HA): Using Multi-AZ for RDS, EBS, and MSK.
- Disaster Recovery (DR): Regional failure protection using CRR and snapshots.
- Storage Resiliency Strategies
- Amazon S3: Versioning (accidental deletion), Object Lock (compliance), and Replication.
- Amazon EBS/RDS: Multi-AZ deployments for automated failover.
- Amazon Redshift: Cross-Region snapshots for regional recovery.
- Data Lifecycle Management (DLM)
- Automation: Using S3 Lifecycle Policies to transition data (Standard -> IA -> Glacier).
- Archiving: Long-term storage in S3 Glacier for legal/compliance needs.
- Cleanup: DynamoDB TTL and automated S3 expiration to minimize costs.
- Security & Governance
- Discovery: Amazon Macie for identifying PII in S3.
- Enforcement: AWS Config rules to ensure deletion and encryption policies are met.
- Protection: AWS Shield for DDoS and AWS Backup for centralized management.
Visual Anchors
S3 Lifecycle Flow
Resiliency Architecture Patterns
Definition-Example Pairs
- S3 Versioning: Storing multiple iterations of the same object.
- Example: If a script accidentally deletes
sales_data.csv, versioning allows you to restore the previous version instantly.
- Example: If a script accidentally deletes
- S3 Object Lock: Prevents an object from being deleted or overwritten for a fixed amount of time.
- Example: A financial firm must keep records for 7 years per SEC rules; Object Lock ensures no admin can delete them prematurely.
- DynamoDB TTL: Automatically deleting items based on a timestamp attribute.
- Example: Deleting temporary session data for a web app after 24 hours of inactivity to keep the table size (and cost) lean.
Worked Examples
Scenario 1: Determining the DR Strategy
Problem: A healthcare company requires an architecture where data is replicated in real-time to a second region. They can tolerate almost zero data loss (RPO < 1 min) and need to be back online within 10 minutes of a regional failure. Solution:
- Architecture: Active-Active or Warm Standby (Active-Passive).
- S3: Enable Cross-Region Replication (CRR).
- RDS: Use Cross-Region Read Replicas or Aurora Global Database.
- Result: Low RPO via continuous replication; Low RTO via pre-provisioned resources in the second region.
Scenario 2: S3 Lifecycle Policy Configuration
Problem: You have logs that are accessed daily for 30 days, then rarely accessed for the next year, and must be kept for 5 years for compliance. Solution:
- Transition 1: After 30 days, move from S3 Standard to S3 Standard-IA.
- Transition 2: After 365 days, move to S3 Glacier Deep Archive.
- Expiration: After 1,825 days (5 years), delete the object.
Checkpoint Questions
- What is the main difference between S3 Cross-Region Replication and S3 Versioning regarding data protection?
- Which service would you use to discover sensitive PII data before moving it to an archive?
- True or False: Serverless analytics solutions like AWS Glue have built-in high availability.
- If a company can tolerate losing 24 hours of data, what is their RPO?
Comparison Tables
Disaster Recovery Architecture Comparison
| Strategy | Cost | RTO/RPO | Complexity |
|---|---|---|---|
| Backup & Restore | Low | Hours/Days | Simple |
| Pilot Light | Medium | Minutes/Hours | Moderate |
| Warm Standby | High | Minutes | High |
| Active-Active | Very High | Near Zero | Very High |
Hot vs. Cold Storage Services
| Requirement | Service (Hot) | Service (Cold) |
|---|---|---|
| Latency | Milliseconds (DynamoDB/ElastiCache) | Minutes/Hours (Glacier) |
| Access Frequency | High (S3 Standard) | Low (S3 Glacier / IA) |
| Cost per GB | High | Very Low |
Muddy Points & Cross-Refs
- Availability vs. Resiliency: Many students confuse these. Think of availability as the result (Is it up?) and resiliency as the method (How do we keep it up when things break?).
- S3 Intelligent-Tiering: Unlike manual Lifecycle policies, this automatically moves data based on observed access patterns. Use this if your access patterns are unknown or unpredictable.
- Multi-AZ vs. Read Replicas: Multi-AZ is for HA (failover); Read Replicas are primarily for scaling performance, though Cross-Region Read Replicas can be part of a DR strategy.