Lab: Transform Data and Perform Feature Engineering with AWS SageMaker

This lab provides a hands-on experience in transforming raw datasets into machine learning-ready features. You will use Amazon SageMaker and AWS CLI to perform common feature engineering tasks such as scaling, encoding, and storing features in a centralized Feature Store.

Prerequisites

Before starting this lab, ensure you have the following:

An AWS Account with administrative access or the AmazonSageMakerFullAccess managed policy.
AWS CLI configured on your local machine with credentials (aws configure).
Basic familiarity with Python and Pandas (though most steps are UI/CLI based).
A local copy of a sample CSV dataset (e.g., a housing or credit risk dataset).

Learning Objectives

By the end of this lab, you will be able to:

Clean and Ingest Data: Handle missing values and upload raw data to Amazon S3.
Perform Numerical Scaling: Apply Min-Max Normalization and Z-Score Standardization.
Apply Categorical Encoding: Implement One-Hot Encoding for nominal data and Label Encoding for ordinal data.
Manage Features: Store engineered features in the Amazon SageMaker Feature Store for consistency.

Architecture Overview

The following diagram illustrates the flow of data from ingestion to the feature store.

Loading Diagram...

Step-by-Step Instructions

Step 1: Prepare the S3 Bucket and Data

First, we need a landing zone for our raw data.

bash

# Create a unique bucket name
BUCKET_NAME=brainybee-lab-$(date +%s)
aws s3 mb s3://$BUCKET_NAME

# Upload your raw CSV dataset
aws s3 cp raw_data.csv s3://$BUCKET_NAME/input/raw_data.csv

▶Console alternative

Navigate to

in the AWS Console. 2. Click

Create bucket

, enter a unique name, and click

Create

. 3. Click into your bucket, select

Upload

, and add your local CSV file.

Step 2: Initialize SageMaker Data Wrangler

We will use Data Wrangler (now part of SageMaker Canvas) to visually design our transformations.

Open the SageMaker Console.
In the left sidebar, click Canvas (or Data Wrangler if using an older UI version).
Click New Analysis and name it Feature-Engineering-Lab.
Select Import Data and choose the S3 bucket/file created in Step 1.

Step 3: Numerical Feature Scaling

We need to ensure all numerical features contribute equally. We will apply Normalization (0 to 1 range).

[!NOTE] Normalization is best when the distribution is unknown or scale matters. Standardization is preferred for normally distributed data to minimize outlier impact.

Visualizing the Difference:

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

In the Data Wrangler UI, click + Add Step > Transform.
Search for Process Numeric > Scale Values.
Choose Min-Max Scaler.
Select your numeric columns (e.g., Price, Income).
Click Preview and then Add.

Step 4: Categorical Encoding

Machine learning models require numerical input. We will convert categories into numbers.

Click + Add Step > Transform.
Search for Encode Categorical.
Choose One-Hot Encode for nominal data (e.g., Color, City).
Select Label Prediction for ordinal data (e.g., Ranking).
Click Add.

Step 5: Export to SageMaker Feature Store

Once transformations are complete, we persist the data for reuse.

Click on Export at the top of the Data Wrangler flow.
Choose Amazon SageMaker Feature Store.
Define a Feature Group Name (e.g., transformed-housing-features).
Configure the Record Identifier (usually a unique ID column) and Event Time column.
Click Run.

Checkpoints

Checkpoint	Action	Expected Result
Data Import	Check Data Wrangler preview.	Table displays first 100 rows correctly.
Scaling	Check the 'Price' column values.	All values are between 0.0 and 1.0.
Encoding	Look for new columns like `City_NewYork`.	Original 'City' column is replaced by binary flags.
Persistence	Run `aws sagemaker list-feature-groups` in CLI.	The group name created in Step 5 appears in the list.

Troubleshooting

Error	Likely Cause	Solution
`AccessDenied` on S3	Missing IAM Permissions	Ensure your SageMaker Execution Role has `s3:GetObject` and `s3:ListBucket`.
`ResourceLimitExceeded`	Too many active Canvas apps	Shutdown existing SageMaker Canvas apps before starting a new one.
Scaling failed	Non-numeric data in column	Use the "Handle missing values" transform to remove nulls or strings first.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources will result in ongoing hourly charges for SageMaker Canvas and storage costs for S3/Feature Store.

Stop SageMaker Canvas: In the SageMaker Console, go to Canvas and click Log out or Delete app.
Delete Feature Group:
bash
aws sagemaker delete-feature-group --feature-group-name transformed-housing-features
Empty and Delete S3 Bucket:
bash
aws s3 rb s3://$BUCKET_NAME --force

Stretch Challenge

Task: Implement Feature Hashing for a high-cardinality categorical column (a column with thousands of unique values). Compare the number of output columns generated by Feature Hashing versus One-Hot Encoding.

▶Hint

Use the "Search for Transforms" box in Data Wrangler and type "Hashing". Note how it allows you to specify a fixed number of output features to save memory.

Cost Estimate

SageMaker Canvas: ~$1.90 per hour (instance cost) + $0.15 per million cells processed.
S3 Storage: $0.023 per GB (negligible for this lab).
Feature Store: Free tier includes 10M write operations/month; otherwise, $0.10 per GB-month.
Estimated Total: < $5.00 if completed and torn down within 1 hour.

Concept Review

Technique	Data Type	Best Use Case
Min-Max Scaling	Numeric	Normalizing features to a range [0,1] for algorithms like KNN.
Z-Score (Standardization)	Numeric	Handling outliers by centering data around mean=0, std=1.
One-Hot Encoding	Categorical	Nominal data with low cardinality (e.g., Red, Blue, Green).
Label Encoding	Categorical	Ordinal data where order matters (e.g., Small, Medium, Large).
Binary Encoding	Categorical	High cardinality data to reduce dimensionality vs One-Hot.