AWS Study Guide: Provisioned vs. Serverless Services

This guide explores the architectural and operational tradeoffs between provisioned infrastructure (like EC2) and serverless paradigms (like Lambda and Athena) within the AWS ecosystem, specifically for data engineering workloads.

Learning Objectives

After studying this guide, you should be able to:

Distinguish between provisioned, managed, and serverless service models.
Analyze cost implications of pay-per-use vs. hourly instance billing.
Evaluate operational overhead differences including patching, scaling, and maintenance.
Identify appropriate use cases for serverless tools (Lambda, Athena, Glue) versus provisioned tools (EC2, Redshift clusters).

Key Terms & Glossary

Provisioned Service: A model where you specify the capacity (CPU, RAM, Storage) in advance, often paying for the resource as long as it is "running," regardless of actual usage.
Serverless: A cloud execution model where the provider dynamically manages the allocation of machine resources. Users are only aware of their code or queries and the events that trigger them.
Elasticity: The ability to scale resources up and down to meet changing demand automatically.
Statelessness: A design principle where each request or execution is independent and does not store data locally between runs (critical for Lambda).
Cold Start: The delay encountered when a serverless function is triggered for the first time or after a period of inactivity as the environment is initialized.

The "Big Idea"

In modern data engineering, the shift from Provisioned to Serverless is a shift from Infrastructure Management to Business Logic. Instead of spending time patching Linux kernels or rightsizing instance types, engineers focus on writing the transformations (Python/SQL) and orchestrating workflows. However, this convenience comes at the cost of less granular control over the underlying environment and potential performance variance.

Formula / Concept Box

Concept	Formula / Rule	Notes
Provisioned Cost	$Cost = (Instance\ Rate \times Hours) + Storage$	Fixed regardless of traffic.
Serverless Cost	$Cost = (Executions \times Duration \times Memory) + Data\ Scanned$	Scales linearly with actual work.
Lambda Timeout	Max\ Duration = 15\ minutes	Hard limit; use Step Functions for longer tasks.
Scaling Rule	$Serverless \approx Vertical + Horizontal\ Auto$	Managed entirely by the cloud provider.

Hierarchical Outline

Provisioned Infrastructure (IaaS)
- EC2 & RDS: Full control over OS; manual or auto-scaling configuration required.
- Cost Model: Predictable for steady-state; expensive for idle time.
- Overhead: Requires patching, security updates, and software installation.
Serverless Computing (FaaS/Managed)
- AWS Lambda: Event-driven execution (e.g., S3 upload triggers a script).
- Amazon Athena: Querying S3 data directly using SQL without a database cluster.
- Operational Benefits: No server maintenance; automatic high availability.
Decision Tradeoffs
- Performance: Provisioned offers consistent low latency; Serverless can suffer from "cold starts."
- Duration: Serverless has execution limits; provisioned can run indefinitely.
- State: Serverless is stateless; local disk is ephemeral.

Visual Anchors

Decision Logic: Provisioned vs. Serverless

Loading Diagram...

Cost Efficiency Graph

This graph illustrates the "Break-even Point" where the overhead of a provisioned instance becomes cheaper than the granular billing of serverless for constant, high-volume workloads.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Event-Driven Execution: Code that runs only when a specific signal occurs.
- Example: An AWS Lambda function that triggers specifically when a .csv file is uploaded to an S3 bucket to convert it to Parquet.
Managed Service: An AWS service where the provider handles the infrastructure but you still manage some configurations (like cluster size).
- Example: Amazon EMR (Elastic MapReduce) where AWS manages the Hadoop/Spark install, but you choose the instance types.
Pay-per-use: A billing model where costs are tied strictly to the amount of data processed or duration of execution.
- Example: Amazon Athena charging $5.00 per Terabyte of data scanned, with $0 cost if no queries are run.

Worked Examples

Scenario: Log Processing Pipeline

Problem: A company receives 1,000 log files daily, but they all arrive within a 10-minute window at midnight.

Option A (Provisioned): An EC2 t3.medium instance costs ~$30/month. It sits idle for 23 hours and 50 minutes every day.
Option B (Serverless): 1,000 Lambda invocations per day (approx. 2 seconds each). Total cost: pennies per month due to the AWS Free Tier and granular billing.
Decision: Serverless is the winner due to the spiky nature of the workload and the high idle cost of EC2.

Checkpoint Questions

What is the maximum execution time for an AWS Lambda function?
In which model are you responsible for OS security patching: EC2 or Lambda?
If you have a steady, predictable 24/7 database workload, which is likely more cost-effective: Provisioned RDS or Aurora Serverless?
Why is Lambda considered "stateless"?

Comparison Tables

Feature	Provisioned (e.g., EC2, Redshift)	Serverless (e.g., Lambda, Athena)
Scaling	Manual or Auto-scaling groups	Automatic and transparent
Management	High (OS, patching, software)	Near Zero (No server access)
Payment	Hourly/Monthly (Fixed)	Per request/Data volume (Variable)
Time Limits	Unlimited	15 minutes (Lambda)
Startup Time	Minutes (Booting)	Milliseconds to Seconds (Cold Start)
Customization	High (Kernel/OS level)	Low (Language runtime only)

Muddy Points & Cross-Refs

"Serverless" doesn't mean no servers: There are still physical servers in an AWS data center; you just don't have access to them or the responsibility to maintain them.
Cold Starts: This is the most common "muddy point." If your application requires sub-millisecond responses 100% of the time, Lambda might frustrate you unless you use "Provisioned Concurrency."
Lambda Monoliths: Avoid "Monolithic Lambdas" (large packages with too many functions). Break them into microservices for better performance and easier IAM permissions.
State Management: If your function needs to "remember" something from the last run, you must use an external store like DynamoDB or S3, as the local storage (/tmp) is not guaranteed to persist.