Data Transformation Concepts for AWS Data Engineers

This guide covers the core data transformation techniques required for the AWS Certified Data Engineer – Associate (DEA-C01) exam, specifically focusing on how to summarize and restructure data for analytics.

Learning Objectives

After studying this guide, you should be able to:

Define and implement Data Aggregation using SQL and Spark.
Calculate Rolling Averages using window functions.
Differentiate between Grouping and Pivoting operations.
Identify AWS services (Athena, QuickSight, Glue) used for these transformations.

Key Terms & Glossary

Aggregation: The process of gathering and summarizing multiple data points into a single value (e.g., Sum, Mean).
Grouping: Categorizing a dataset based on one or more columns so that operations can be performed on each category independently.
Rolling Average: A calculation used to analyze data points by creating a series of averages of different subsets of the full data set (also known as a moving average).
Pivoting: A data transformation that rotates data from a state of rows to a state of columns, often summarizing the data in the process.
Window Function: A function that performs a calculation across a set of table rows that are somehow related to the current row.

The "Big Idea"

Raw data is often too granular for decision-making. To derive value, data engineers must "compress" or "reshape" this information. Aggregation and Grouping reduce volume to highlight trends, while Pivoting changes the perspective of the data to make it readable for business analysts. These operations form the backbone of ETL (Extract, Transform, Load) processes in AWS Glue and Redshift.

Formula / Concept Box

Concept	SQL / Logic Syntax	Common Use Case
Aggregation	`SELECT SUM(sales) FROM table;`	Total revenue calculation.
Grouping	`SELECT region, COUNT(*) FROM table GROUP BY region;`	Customer count per state.
Rolling Average	`AVG(val) OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`	Smoothing stock price volatility.
Pivoting	`SELECT * FROM table PIVOT (SUM(sales) FOR month IN ('Jan', 'Feb'));`	Converting monthly rows into column headers.

Hierarchical Outline

I. Data Grouping
- KeyBy Operations: In streaming (Kinesis/Flink), grouping data by a key field to ensure related data is processed together.
- SQL GROUP BY: The foundational method for organizing rows with identical values in specified columns.
II. Data Aggregation
- Standard Functions: COUNT, SUM, AVG, MIN, MAX.
- Stateful vs. Stateless: Aggregations in streaming can be stateful (requiring memory of previous events).
III. Advanced Windowing
- Rolling Averages: Using OVER() clauses to calculate averages over a sliding time or row window.
- Window Types: Tumbling (fixed, non-overlapping) vs. Sliding (overlapping).
IV. Data Reshaping
- Pivoting: Transforming long-format data into wide-format data for reporting.
- Unpivoting: Converting wide data back into a normalized, long format.

Visual Anchors

The Aggregation Pipeline

Loading Diagram...

Visualizing a Rolling Average

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Rolling Average: A calculation to smooth out short-term fluctuations and highlight longer-term trends.
- Example: An AWS CloudWatch alarm that triggers only if the 5-minute rolling average of CPU utilization exceeds 80%, preventing alerts on brief spikes.
Pivoting: Rotating data to create a multidimensional view.
- Example: In Amazon QuickSight, taking a list of daily sales records and pivoting them so that "Product Names" are rows and "Months of the Year" are columns.

Worked Examples

Example 1: SQL Rolling Average

Scenario: Calculate a 7-day moving average of daily revenue to identify growth trends.

sql

SELECT 
    sales_date, 
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY sales_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM sales_table;

Explanation: OVER defines the window. 6 PRECEDING AND CURRENT ROW ensures the average always includes exactly one week of data ending today.

Example 2: Aggregation in Spark (PySpark)

Scenario: Find the total volume and average price for items sold by category using AWS Glue.

python

from pyspark.sql import functions as F

summary_df = raw_df.groupBy("category") \
    .agg(F.sum("quantity").alias("total_vol"), 
         F.avg("price").alias("avg_price"))

Checkpoint Questions

What is the main difference between a standard aggregation and a rolling average?
In AWS Glue DataBrew, which transformation would you use to change a "Long" dataset (many rows per ID) into a "Wide" dataset (one row per ID with multiple columns)?
Why is KeyBy necessary before performing an aggregation in a streaming DataStream?

[!TIP] Answer Key:

Standard aggregation returns one value for the entire group; a rolling average returns a value for every row based on a sliding window.

Pivoting.

It ensures that all data points sharing the same key are sent to the same processing node for accurate summation/averaging.

Comparison Tables

Feature	Grouping	Pivoting
Result Structure	Vertical (fewer rows)	Horizontal (more columns)
Primary Goal	Segregation of data	Visual comparison/Matrix view
Dimensionality	Usually 1-2 dimensions	Multi-dimensional
AWS Tool	Athena, Redshift, Glue	QuickSight, DataBrew

Muddy Points & Cross-Refs

Rolling Average vs. Cumulative Sum: A rolling average has a fixed window size (e.g., last 30 days). A cumulative sum (running total) grows from the start of the dataset to the end without dropping old data.
Performance Issues: Aggregating over non-partitioned columns in Amazon Athena can lead to high costs and slow performance. Always try to GROUP BY partitioned columns (like year or region) where possible.
Pivoting Limits: Be careful when pivoting data with high cardinality (e.g., pivoting on CustomerID if you have millions of customers). This can create a table with too many columns for most databases to handle.

Data Transformation Concepts for AWS Data Engineers

Learning Objectives

After studying this guide, you should be able to:

Define and implement Data Aggregation using SQL and Spark.
Calculate Rolling Averages using window functions.
Differentiate between Grouping and Pivoting operations.
Identify AWS services (Athena, QuickSight, Glue) used for these transformations.

Key Terms & Glossary

Aggregation: The process of gathering and summarizing multiple data points into a single value (e.g., Sum, Mean).
Grouping: Categorizing a dataset based on one or more columns so that operations can be performed on each category independently.
Rolling Average: A calculation used to analyze data points by creating a series of averages of different subsets of the full data set (also known as a moving average).
Pivoting: A data transformation that rotates data from a state of rows to a state of columns, often summarizing the data in the process.
Window Function: A function that performs a calculation across a set of table rows that are somehow related to the current row.

The "Big Idea"

Formula / Concept Box

Concept	SQL / Logic Syntax	Common Use Case
Aggregation	`SELECT SUM(sales) FROM table;`	Total revenue calculation.
Grouping	`SELECT region, COUNT(*) FROM table GROUP BY region;`	Customer count per state.
Rolling Average	`AVG(val) OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`	Smoothing stock price volatility.
Pivoting	`SELECT * FROM table PIVOT (SUM(sales) FOR month IN ('Jan', 'Feb'));`	Converting monthly rows into column headers.

Hierarchical Outline

I. Data Grouping
- KeyBy Operations: In streaming (Kinesis/Flink), grouping data by a key field to ensure related data is processed together.
- SQL GROUP BY: The foundational method for organizing rows with identical values in specified columns.
II. Data Aggregation
- Standard Functions: COUNT, SUM, AVG, MIN, MAX.
- Stateful vs. Stateless: Aggregations in streaming can be stateful (requiring memory of previous events).
III. Advanced Windowing
- Rolling Averages: Using OVER() clauses to calculate averages over a sliding time or row window.
- Window Types: Tumbling (fixed, non-overlapping) vs. Sliding (overlapping).
IV. Data Reshaping
- Pivoting: Transforming long-format data into wide-format data for reporting.
- Unpivoting: Converting wide data back into a normalized, long format.

Visual Anchors

The Aggregation Pipeline

Loading Diagram...

Visualizing a Rolling Average

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Rolling Average: A calculation to smooth out short-term fluctuations and highlight longer-term trends.
- Example: An AWS CloudWatch alarm that triggers only if the 5-minute rolling average of CPU utilization exceeds 80%, preventing alerts on brief spikes.
Pivoting: Rotating data to create a multidimensional view.
- Example: In Amazon QuickSight, taking a list of daily sales records and pivoting them so that "Product Names" are rows and "Months of the Year" are columns.

Worked Examples

Example 1: SQL Rolling Average

Scenario: Calculate a 7-day moving average of daily revenue to identify growth trends.

sql

SELECT 
    sales_date, 
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY sales_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM sales_table;

Explanation: OVER defines the window. 6 PRECEDING AND CURRENT ROW ensures the average always includes exactly one week of data ending today.

Example 2: Aggregation in Spark (PySpark)

Scenario: Find the total volume and average price for items sold by category using AWS Glue.

python

from pyspark.sql import functions as F

summary_df = raw_df.groupBy("category") \
    .agg(F.sum("quantity").alias("total_vol"), 
         F.avg("price").alias("avg_price"))

Checkpoint Questions

What is the main difference between a standard aggregation and a rolling average?
In AWS Glue DataBrew, which transformation would you use to change a "Long" dataset (many rows per ID) into a "Wide" dataset (one row per ID with multiple columns)?
Why is KeyBy necessary before performing an aggregation in a streaming DataStream?

[!TIP] Answer Key:

Standard aggregation returns one value for the entire group; a rolling average returns a value for every row based on a sliding window.

Pivoting.

It ensures that all data points sharing the same key are sent to the same processing node for accurate summation/averaging.

Comparison Tables

Feature	Grouping	Pivoting
Result Structure	Vertical (fewer rows)	Horizontal (more columns)
Primary Goal	Segregation of data	Visual comparison/Matrix view
Dimensionality	Usually 1-2 dimensions	Multi-dimensional
AWS Tool	Athena, Redshift, Glue	QuickSight, DataBrew

Muddy Points & Cross-Refs

Rolling Average vs. Cumulative Sum: A rolling average has a fixed window size (e.g., last 30 days). A cumulative sum (running total) grows from the start of the dataset to the end without dropping old data.
Performance Issues: Aggregating over non-partitioned columns in Amazon Athena can lead to high costs and slow performance. Always try to GROUP BY partitioned columns (like year or region) where possible.
Pivoting Limits: Be careful when pivoting data with high cardinality (e.g., pivoting on CustomerID if you have millions of customers). This can create a table with too many columns for most databases to handle.

AWS Data Engineering: Data Aggregation, Rolling Averages, Grouping, and Pivoting

Data Transformation Concepts for AWS Data Engineers

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The Aggregation Pipeline

Visualizing a Rolling Average

Definition-Example Pairs

Worked Examples

Example 1: SQL Rolling Average

Example 2: Aggregation in Spark (PySpark)

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs

AWS Data Engineering: Data Aggregation, Rolling Averages, Grouping, and Pivoting

Data Transformation Concepts for AWS Data Engineers

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The Aggregation Pipeline

Visualizing a Rolling Average

Definition-Example Pairs

Worked Examples

Example 1: SQL Rolling Average

Example 2: Aggregation in Spark (PySpark)

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs