Mastering Data Lifecycle: S3 Versioning and DynamoDB TTL
Manage S3 versioning and DynamoDB TTL
Mastering Data Lifecycle: S3 Versioning and DynamoDB TTL
This guide covers critical data resiliency and lifecycle management techniques required for the AWS Certified Data Engineer – Associate (DEA-C01) exam. We focus on protecting object integrity in Amazon S3 and automating data expiration in Amazon DynamoDB.
Learning Objectives
After studying this guide, you should be able to:
- Configure and manage Amazon S3 Versioning to protect against accidental data loss.
- Implement DynamoDB Time to Live (TTL) to automate record expiration and reduce storage costs.
- Design an archival pipeline that moves expired DynamoDB items to S3 using Kinesis and Firehose.
- Distinguish between Enabled and Suspended versioning states in S3.
Key Terms & Glossary
- Noncurrent Version: An older version of an S3 object that is retained after a new version is uploaded or the object is deleted.
- Delete Marker: A placeholder in a versioned S3 bucket that indicates an object has been deleted, without actually removing the underlying data.
- TTL Attribute: A specific attribute in a DynamoDB item (formatted as a Unix timestamp) that defines when the item is eligible for deletion.
- Epoch Time: A system for describing a point in time, defined as the number of seconds that have elapsed since 00:00:00 UTC, January 1, 1970.
- MFA Delete: An additional security layer requiring a hardware or software token to permanently delete an object version or change bucket versioning status.
The "Big Idea"
Data is not static. It has a lifecycle that moves from Ingestion to Active Use to Archive/Deletion. As a Data Engineer, your goal is to automate this flow so that high-value data is protected (S3 Versioning) while low-value, expired data is removed or archived to cheaper storage (DynamoDB TTL + S3) without manual intervention. This ensures both compliance (GDPR/HIPAA) and cost-optimization.
Formula / Concept Box
| Feature | Core Requirement | Key Behavior |
|---|---|---|
| S3 Versioning | Bucket-level setting | Cannot be disabled once enabled; only suspended. |
| DynamoDB TTL | Number (Attribute) | Must be in Unix Epoch format (seconds). |
| Archiving Filter | Lambda / Firehose | Filter by userIdentity.principalId: "dynamodb.amazonaws.com". |
Hierarchical Outline
- I. Amazon S3 Versioning
- A. Purpose: Data resiliency, accidental deletion protection, and regulatory compliance.
- B. Operational States:
- Unversioned: Default state; new uploads overwrite old ones.
- Enabled: All versions kept; unique Version IDs assigned.
- Suspended: Stops creating new versions; existing versions remain.
- C. Security: MFA Delete prevents unauthorized permanent deletions.
- II. DynamoDB Time to Live (TTL)
- A. Functionality: Automatically deletes items based on a timestamp attribute.
- B. Benefits: No extra cost for deletions; reduces storage overhead; improves performance.
- C. Archival Workflow:
- Kinesis Integration: TTL deletions are captured in DynamoDB Streams.
- Firehose/Lambda: Filters TTL-specific events from standard user deletes.
- S3 Storage: Final long-term archival in JSON format.
Visual Anchors
DynamoDB to S3 Archival Pipeline
S3 Versioning Stack Logic
\begin{tikzpicture}[node distance=1.5cm] \draw[thick] (0,0) rectangle (4,1) node[midway] {v3 (Current)}; \draw[thick] (0,1.2) rectangle (4,2.2) node[midway] {v2 (Noncurrent)}; \draw[thick] (0,2.4) rectangle (4,3.4) node[midway] {v1 (Noncurrent)}; \draw[->, thick] (4.5, 0.5) -- (5.5, 0.5) node[right] {Latest PUT}; \draw[dashed] (-1, -0.5) rectangle (5, 4); \node at (2, -0.8) {S3 Versioning Stack}; \end{tikzpicture}
Definition-Example Pairs
- S3 Suspended State: A state where S3 stops generating new version IDs for new uploads.
- Example: You enable versioning for a month, then realize storage costs are too high. You "suspend" it. Your old 1,000 versions are saved, but new uploads now have a version ID of
nulland will overwrite each other.
- Example: You enable versioning for a month, then realize storage costs are too high. You "suspend" it. Your old 1,000 versions are saved, but new uploads now have a version ID of
- TTL Expiration: Automated deletion of stale data.
- Example: A session tracking table stores login tokens. You set a
TTLattribute for 24 hours from login. DynamoDB deletes the record automatically tomorrow at that exact second, saving you from writing a custom cleanup script.
- Example: A session tracking table stores login tokens. You set a
Worked Examples
1. Enabling S3 Versioning via CLI
To enable versioning on a bucket named my-data-lake, use the following command:
aws s3api put-bucket-versioning \
--bucket my-data-lake \
--versioning-configuration Status=Enabled[!IMPORTANT] After this, every
DELETErequest without a version ID only adds a Delete Marker. To truly remove data, you must specify theversionIdin the delete call.
2. Identifying TTL Deletes in a Stream
When archiving DynamoDB data, you only want the items deleted by the system (TTL), not those deleted by users. In your Lambda function, check the userIdentity field:
{
"userIdentity": {
"type": "Service",
"principalId": "dynamodb.amazonaws.com"
}
}If the principalId matches the above, the event was a TTL expiration.
Checkpoint Questions
- What happens to existing versions of an object when you move an S3 bucket from "Enabled" to "Suspended" versioning?
- In DynamoDB, what format must the TTL attribute be in for the service to recognize it?
- True or False: DynamoDB TTL deletions consume write capacity units (WCU).
- How does S3 Versioning protect against accidental
DELETEoperations?
Comparison Tables
S3 Versioning vs. S3 Object Lock
| Feature | S3 Versioning | S3 Object Lock |
|---|---|---|
| Primary Goal | Recovery and history. | WORM (Write Once Read Many) compliance. |
| Deletion | Can delete if you have the version ID. | Cannot delete until retention period expires. |
| Use Case | Accidental overwrite protection. | Legal holds and regulatory requirements. |
Muddy Points & Cross-Refs
- TTL Latency: Items don't disappear the exact millisecond they expire. DynamoDB typically deletes them within 48 hours of expiration. Don't use TTL for real-time logic that requires second-level precision.
- Versioning vs. Backup: Versioning is not a replacement for a backup strategy (like AWS Backup). If the bucket itself is deleted, all versions are lost unless MFA Delete or cross-region replication is enabled.
- Storage Costs: Remember that versioning keeps every copy. If you overwrite a 1GB file 10 times, you are paying for 10GB of storage. Use S3 Lifecycle Policies to move old versions to S3 Glacier.