Study Guide945 words

Mastering Data Lifecycle: S3 Versioning and DynamoDB TTL

Manage S3 versioning and DynamoDB TTL

Mastering Data Lifecycle: S3 Versioning and DynamoDB TTL

This guide covers critical data resiliency and lifecycle management techniques required for the AWS Certified Data Engineer – Associate (DEA-C01) exam. We focus on protecting object integrity in Amazon S3 and automating data expiration in Amazon DynamoDB.

Learning Objectives

After studying this guide, you should be able to:

  • Configure and manage Amazon S3 Versioning to protect against accidental data loss.
  • Implement DynamoDB Time to Live (TTL) to automate record expiration and reduce storage costs.
  • Design an archival pipeline that moves expired DynamoDB items to S3 using Kinesis and Firehose.
  • Distinguish between Enabled and Suspended versioning states in S3.

Key Terms & Glossary

  • Noncurrent Version: An older version of an S3 object that is retained after a new version is uploaded or the object is deleted.
  • Delete Marker: A placeholder in a versioned S3 bucket that indicates an object has been deleted, without actually removing the underlying data.
  • TTL Attribute: A specific attribute in a DynamoDB item (formatted as a Unix timestamp) that defines when the item is eligible for deletion.
  • Epoch Time: A system for describing a point in time, defined as the number of seconds that have elapsed since 00:00:00 UTC, January 1, 1970.
  • MFA Delete: An additional security layer requiring a hardware or software token to permanently delete an object version or change bucket versioning status.

The "Big Idea"

Data is not static. It has a lifecycle that moves from Ingestion to Active Use to Archive/Deletion. As a Data Engineer, your goal is to automate this flow so that high-value data is protected (S3 Versioning) while low-value, expired data is removed or archived to cheaper storage (DynamoDB TTL + S3) without manual intervention. This ensures both compliance (GDPR/HIPAA) and cost-optimization.

Formula / Concept Box

FeatureCore RequirementKey Behavior
S3 VersioningBucket-level settingCannot be disabled once enabled; only suspended.
DynamoDB TTLNumber (Attribute)Must be in Unix Epoch format (seconds).
Archiving FilterLambda / FirehoseFilter by userIdentity.principalId: "dynamodb.amazonaws.com".

Hierarchical Outline

  • I. Amazon S3 Versioning
    • A. Purpose: Data resiliency, accidental deletion protection, and regulatory compliance.
    • B. Operational States:
      • Unversioned: Default state; new uploads overwrite old ones.
      • Enabled: All versions kept; unique Version IDs assigned.
      • Suspended: Stops creating new versions; existing versions remain.
    • C. Security: MFA Delete prevents unauthorized permanent deletions.
  • II. DynamoDB Time to Live (TTL)
    • A. Functionality: Automatically deletes items based on a timestamp attribute.
    • B. Benefits: No extra cost for deletions; reduces storage overhead; improves performance.
    • C. Archival Workflow:
      • Kinesis Integration: TTL deletions are captured in DynamoDB Streams.
      • Firehose/Lambda: Filters TTL-specific events from standard user deletes.
      • S3 Storage: Final long-term archival in JSON format.

Visual Anchors

DynamoDB to S3 Archival Pipeline

Loading Diagram...

S3 Versioning Stack Logic

\begin{tikzpicture}[node distance=1.5cm] \draw[thick] (0,0) rectangle (4,1) node[midway] {v3 (Current)}; \draw[thick] (0,1.2) rectangle (4,2.2) node[midway] {v2 (Noncurrent)}; \draw[thick] (0,2.4) rectangle (4,3.4) node[midway] {v1 (Noncurrent)}; \draw[->, thick] (4.5, 0.5) -- (5.5, 0.5) node[right] {Latest PUT}; \draw[dashed] (-1, -0.5) rectangle (5, 4); \node at (2, -0.8) {S3 Versioning Stack}; \end{tikzpicture}

Definition-Example Pairs

  • S3 Suspended State: A state where S3 stops generating new version IDs for new uploads.
    • Example: You enable versioning for a month, then realize storage costs are too high. You "suspend" it. Your old 1,000 versions are saved, but new uploads now have a version ID of null and will overwrite each other.
  • TTL Expiration: Automated deletion of stale data.
    • Example: A session tracking table stores login tokens. You set a TTL attribute for 24 hours from login. DynamoDB deletes the record automatically tomorrow at that exact second, saving you from writing a custom cleanup script.

Worked Examples

1. Enabling S3 Versioning via CLI

To enable versioning on a bucket named my-data-lake, use the following command:

bash
aws s3api put-bucket-versioning \ --bucket my-data-lake \ --versioning-configuration Status=Enabled

[!IMPORTANT] After this, every DELETE request without a version ID only adds a Delete Marker. To truly remove data, you must specify the versionId in the delete call.

2. Identifying TTL Deletes in a Stream

When archiving DynamoDB data, you only want the items deleted by the system (TTL), not those deleted by users. In your Lambda function, check the userIdentity field:

json
{ "userIdentity": { "type": "Service", "principalId": "dynamodb.amazonaws.com" } }

If the principalId matches the above, the event was a TTL expiration.

Checkpoint Questions

  1. What happens to existing versions of an object when you move an S3 bucket from "Enabled" to "Suspended" versioning?
  2. In DynamoDB, what format must the TTL attribute be in for the service to recognize it?
  3. True or False: DynamoDB TTL deletions consume write capacity units (WCU).
  4. How does S3 Versioning protect against accidental DELETE operations?

Comparison Tables

S3 Versioning vs. S3 Object Lock

FeatureS3 VersioningS3 Object Lock
Primary GoalRecovery and history.WORM (Write Once Read Many) compliance.
DeletionCan delete if you have the version ID.Cannot delete until retention period expires.
Use CaseAccidental overwrite protection.Legal holds and regulatory requirements.

Muddy Points & Cross-Refs

  • TTL Latency: Items don't disappear the exact millisecond they expire. DynamoDB typically deletes them within 48 hours of expiration. Don't use TTL for real-time logic that requires second-level precision.
  • Versioning vs. Backup: Versioning is not a replacement for a backup strategy (like AWS Backup). If the bucket itself is deleted, all versions are lost unless MFA Delete or cross-region replication is enabled.
  • Storage Costs: Remember that versioning keeps every copy. If you overwrite a 1GB file 10 times, you are paying for 10GB of storage. Use S3 Lifecycle Policies to move old versions to S3 Glacier.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free