AWS Infrastructure as Code (IaC) for Data Engineering

This guide covers the core principles and AWS services used to manage and provision cloud infrastructure through code, ensuring repeatable and consistent deployments for data pipelines.

Learning Objectives

Define the core concepts of Infrastructure as Code (IaC) and its benefits for data engineering.
Compare and Contrast AWS CloudFormation, AWS Cloud Development Kit (CDK), and AWS Serverless Application Model (SAM).
Explain the role of version control and CI/CD in infrastructure management.
Apply IaC tools to deploy repeatable resources such as Glue jobs and Lambda functions.

Key Terms & Glossary

Infrastructure as Code (IaC): The practice of managing and provisioning infrastructure using machine-readable definition files rather than manual processes.
Construct: The basic building block of AWS CDK apps, representing a single resource or a higher-level abstraction of multiple resources.
Template: A JSON or YAML file that describes the intended state of the AWS infrastructure for CloudFormation.
Stack: A collection of AWS resources that you can manage as a single unit in CloudFormation.
Configuration Drift: When the actual state of resources in the cloud deviates from the defined state in the IaC template.
Synthesis (Synth): The process by which AWS CDK code is converted into a CloudFormation template.

The "Big Idea"

[!IMPORTANT] Infrastructure is Code. In modern data engineering, your infrastructure (servers, databases, ETL jobs) should be treated exactly like your application code. It must be versioned, reviewed, tested, and deployed automatically. This eliminates the "it works on my machine" problem and ensures that a production environment is a perfect mirror of the development environment.

Formula / Concept Box

Disaster Recovery Metrics

Metric	Definition	Goal
RPO (Recovery Point Objective)	Maximum acceptable data loss (time-based).	Minimize data loss.
RTO (Recovery Time Objective)	Maximum acceptable downtime for the system.	Minimize system unavailability.

Hierarchical Outline

Infrastructure as Code Core Concepts
- Consistency: Identical environments across Dev/Test/Prod.
- Automation: Integrated into CI/CD pipelines.
- Documentation: The code is the documentation of the architecture.
AWS CloudFormation (The Foundation)
- Declarative (YAML/JSON) templates.
- Manages resource lifecycle (Create, Update, Delete).
- Handles dependencies between resources automatically.
AWS Cloud Development Kit (CDK)
- Imperative programming (Python, TypeScript, Java).
- Higher-level abstractions (Constructs).
- Synthesizes into CloudFormation templates.
AWS Serverless Application Model (SAM)
- Extension of CloudFormation for serverless workloads.
- Shorthand syntax for Lambda, API Gateway, and DynamoDB.
CI/CD for IaC
- AWS CodeCommit: Version control for IaC files.
- AWS CodeBuild: Automated testing and validation.
- AWS CodePipeline: Automated deployment orchestration.

Visual Anchors

Deployment Pipeline Flow

Loading Diagram...

CDK Abstraction Layers

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Declarative vs. Imperative:
- Definition: Declarative defines "what" to build (CloudFormation); Imperative defines "how" to build it using logic (CDK).
- Example: A CloudFormation template lists a bucket. A CDK script uses a for loop to create 10 buckets based on a list of names.
Reusability:
- Definition: The ability to use the same code logic to deploy identical stacks in different regions or accounts.
- Example: Using a single CloudFormation template to deploy a data lake in us-east-1 and eu-west-1 to ensure regulatory compliance.

Worked Examples

Example 1: CloudFormation (YAML) for a Glue Job

This template defines a simple Glue ETL job.

yaml

Resources:
  MyGlueJob:
    Type: AWS::Glue::Job
    Properties:
      Name: data-ingestion-job
      Role: !Ref GlueJobRole
      Command:
        Name: glueetl
        ScriptLocation: s3://my-scripts-bucket/ingest.py

Example 2: AWS CDK (Python) for Step Functions

Notice how the CDK uses standard programming classes.

python

from aws_cdk import Stack, aws_stepfunctions as sfn

class DataPipelineStack(Stack):
    def __init__(self, scope, id, **kwargs):
        super().__init__(scope, id, **kwargs)
        
        # Define a simple Pass state
        start = sfn.Pass(self, "StartNode")
        
        # Create the state machine
        sfn.StateMachine(self, "MyPipeline",
            definition=start
        )

Checkpoint Questions

What is the primary difference between how you define infrastructure in CloudFormation vs. AWS CDK?
Which service is best suited for a team that only needs to deploy serverless Lambda functions and DynamoDB tables?
Why is version control (like AWS CodeCommit) essential for IaC?
What happens during the "Synthesis" phase of the AWS CDK lifecycle?

Comparison Tables

Aspect	AWS SAM	AWS CDK	CloudFormation
Primary Language	YAML / JSON	Python, TS, Java, etc.	YAML / JSON
Abstraction	High (Serverless focused)	High (Custom constructs)	Low (Raw resources)
Best For	Simple serverless apps	Complex cloud architectures	Traditional infrastructure
Learning Curve	Lower (Simple syntax)	Higher (Needs coding skill)	Medium (Complex templates)
Under the Hood	Extends CloudFormation	Generates CloudFormation	Direct AWS Service

Muddy Points & Cross-Refs

SAM vs. CDK: It can be confusing which to pick for serverless. Tip: Use SAM if you want simple, fast templates specifically for Lambda. Use CDK if your serverless app is part of a much larger, complex infrastructure stack (e.g., VPCs, RDS, Redshift).
The CDK "Double-Step": Remember that CDK is not a separate deployment engine; it is a wrapper. You still need to understand CloudFormation because errors during deployment will often be reported in the CloudFormation console.
Cross-Refs: For more on CI/CD automation, refer to the "Data Operations" chapter. For resource-specific syntax, see the "Glue and Lambda" deep dives.