Lab: Transform Data and Perform Feature Engineering with AWS SageMaker
Transform data and perform feature engineering
Lab: Transform Data and Perform Feature Engineering with AWS SageMaker
This lab provides a hands-on experience in transforming raw datasets into machine learning-ready features. You will use Amazon SageMaker and AWS CLI to perform common feature engineering tasks such as scaling, encoding, and storing features in a centralized Feature Store.
Prerequisites
Before starting this lab, ensure you have the following:
- An AWS Account with administrative access or the
AmazonSageMakerFullAccessmanaged policy. - AWS CLI configured on your local machine with credentials (
aws configure). - Basic familiarity with Python and Pandas (though most steps are UI/CLI based).
- A local copy of a sample CSV dataset (e.g., a housing or credit risk dataset).
Learning Objectives
By the end of this lab, you will be able to:
- Clean and Ingest Data: Handle missing values and upload raw data to Amazon S3.
- Perform Numerical Scaling: Apply Min-Max Normalization and Z-Score Standardization.
- Apply Categorical Encoding: Implement One-Hot Encoding for nominal data and Label Encoding for ordinal data.
- Manage Features: Store engineered features in the Amazon SageMaker Feature Store for consistency.
Architecture Overview
The following diagram illustrates the flow of data from ingestion to the feature store.
Step-by-Step Instructions
Step 1: Prepare the S3 Bucket and Data
First, we need a landing zone for our raw data.
# Create a unique bucket name
BUCKET_NAME=brainybee-lab-$(date +%s)
aws s3 mb s3://$BUCKET_NAME
# Upload your raw CSV dataset
aws s3 cp raw_data.csv s3://$BUCKET_NAME/input/raw_data.csv▶Console alternative
- Navigate to
in the AWS Console. 2. Click
, enter a unique name, and click
. 3. Click into your bucket, select
, and add your local CSV file.
Step 2: Initialize SageMaker Data Wrangler
We will use Data Wrangler (now part of SageMaker Canvas) to visually design our transformations.
- Open the SageMaker Console.
- In the left sidebar, click Canvas (or Data Wrangler if using an older UI version).
- Click New Analysis and name it
Feature-Engineering-Lab. - Select Import Data and choose the S3 bucket/file created in Step 1.
Step 3: Numerical Feature Scaling
We need to ensure all numerical features contribute equally. We will apply Normalization (0 to 1 range).
[!NOTE] Normalization is best when the distribution is unknown or scale matters. Standardization is preferred for normally distributed data to minimize outlier impact.
Visualizing the Difference:
- In the Data Wrangler UI, click + Add Step > Transform.
- Search for Process Numeric > Scale Values.
- Choose Min-Max Scaler.
- Select your numeric columns (e.g.,
Price,Income). - Click Preview and then Add.
Step 4: Categorical Encoding
Machine learning models require numerical input. We will convert categories into numbers.
- Click + Add Step > Transform.
- Search for Encode Categorical.
- Choose One-Hot Encode for nominal data (e.g.,
Color,City). - Select Label Prediction for ordinal data (e.g.,
Ranking). - Click Add.
Step 5: Export to SageMaker Feature Store
Once transformations are complete, we persist the data for reuse.
- Click on Export at the top of the Data Wrangler flow.
- Choose Amazon SageMaker Feature Store.
- Define a Feature Group Name (e.g.,
transformed-housing-features). - Configure the Record Identifier (usually a unique ID column) and Event Time column.
- Click Run.
Checkpoints
| Checkpoint | Action | Expected Result |
|---|---|---|
| Data Import | Check Data Wrangler preview. | Table displays first 100 rows correctly. |
| Scaling | Check the 'Price' column values. | All values are between 0.0 and 1.0. |
| Encoding | Look for new columns like City_NewYork. | Original 'City' column is replaced by binary flags. |
| Persistence | Run aws sagemaker list-feature-groups in CLI. | The group name created in Step 5 appears in the list. |
Troubleshooting
| Error | Likely Cause | Solution |
|---|---|---|
AccessDenied on S3 | Missing IAM Permissions | Ensure your SageMaker Execution Role has s3:GetObject and s3:ListBucket. |
ResourceLimitExceeded | Too many active Canvas apps | Shutdown existing SageMaker Canvas apps before starting a new one. |
| Scaling failed | Non-numeric data in column | Use the "Handle missing values" transform to remove nulls or strings first. |
Clean-Up / Teardown
[!WARNING] Failure to delete these resources will result in ongoing hourly charges for SageMaker Canvas and storage costs for S3/Feature Store.
- Stop SageMaker Canvas: In the SageMaker Console, go to Canvas and click Log out or Delete app.
- Delete Feature Group:
bash
aws sagemaker delete-feature-group --feature-group-name transformed-housing-features - Empty and Delete S3 Bucket:
bash
aws s3 rb s3://$BUCKET_NAME --force
Stretch Challenge
Task: Implement Feature Hashing for a high-cardinality categorical column (a column with thousands of unique values). Compare the number of output columns generated by Feature Hashing versus One-Hot Encoding.
▶Hint
Use the "Search for Transforms" box in Data Wrangler and type "Hashing". Note how it allows you to specify a fixed number of output features to save memory.
Cost Estimate
- SageMaker Canvas: ~$1.90 per hour (instance cost) + $0.15 per million cells processed.
- S3 Storage: $0.023 per GB (negligible for this lab).
- Feature Store: Free tier includes 10M write operations/month; otherwise, $0.10 per GB-month.
- Estimated Total: < $5.00 if completed and torn down within 1 hour.
Concept Review
| Technique | Data Type | Best Use Case |
|---|---|---|
| Min-Max Scaling | Numeric | Normalizing features to a range [0,1] for algorithms like KNN. |
| Z-Score (Standardization) | Numeric | Handling outliers by centering data around mean=0, std=1. |
| One-Hot Encoding | Categorical | Nominal data with low cardinality (e.g., Red, Blue, Green). |
| Label Encoding | Categorical | Ordinal data where order matters (e.g., Small, Medium, Large). |
| Binary Encoding | Categorical | High cardinality data to reduce dimensionality vs One-Hot. |