Hands-On Lab1,054 words

Lab: Transform Data and Perform Feature Engineering with AWS SageMaker

Transform data and perform feature engineering

Lab: Transform Data and Perform Feature Engineering with AWS SageMaker

This lab provides a hands-on experience in transforming raw datasets into machine learning-ready features. You will use Amazon SageMaker and AWS CLI to perform common feature engineering tasks such as scaling, encoding, and storing features in a centralized Feature Store.

Prerequisites

Before starting this lab, ensure you have the following:

  • An AWS Account with administrative access or the AmazonSageMakerFullAccess managed policy.
  • AWS CLI configured on your local machine with credentials (aws configure).
  • Basic familiarity with Python and Pandas (though most steps are UI/CLI based).
  • A local copy of a sample CSV dataset (e.g., a housing or credit risk dataset).

Learning Objectives

By the end of this lab, you will be able to:

  1. Clean and Ingest Data: Handle missing values and upload raw data to Amazon S3.
  2. Perform Numerical Scaling: Apply Min-Max Normalization and Z-Score Standardization.
  3. Apply Categorical Encoding: Implement One-Hot Encoding for nominal data and Label Encoding for ordinal data.
  4. Manage Features: Store engineered features in the Amazon SageMaker Feature Store for consistency.

Architecture Overview

The following diagram illustrates the flow of data from ingestion to the feature store.

Loading Diagram...

Step-by-Step Instructions

Step 1: Prepare the S3 Bucket and Data

First, we need a landing zone for our raw data.

bash
# Create a unique bucket name BUCKET_NAME=brainybee-lab-$(date +%s) aws s3 mb s3://$BUCKET_NAME # Upload your raw CSV dataset aws s3 cp raw_data.csv s3://$BUCKET_NAME/input/raw_data.csv
Console alternative
  1. Navigate to
S3

in the AWS Console. 2. Click

Create bucket

, enter a unique name, and click

Create

. 3. Click into your bucket, select

Upload

, and add your local CSV file.

Step 2: Initialize SageMaker Data Wrangler

We will use Data Wrangler (now part of SageMaker Canvas) to visually design our transformations.

  1. Open the SageMaker Console.
  2. In the left sidebar, click Canvas (or Data Wrangler if using an older UI version).
  3. Click New Analysis and name it Feature-Engineering-Lab.
  4. Select Import Data and choose the S3 bucket/file created in Step 1.

Step 3: Numerical Feature Scaling

We need to ensure all numerical features contribute equally. We will apply Normalization (0 to 1 range).

[!NOTE] Normalization is best when the distribution is unknown or scale matters. Standardization is preferred for normally distributed data to minimize outlier impact.

Visualizing the Difference:

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds
  1. In the Data Wrangler UI, click + Add Step > Transform.
  2. Search for Process Numeric > Scale Values.
  3. Choose Min-Max Scaler.
  4. Select your numeric columns (e.g., Price, Income).
  5. Click Preview and then Add.

Step 4: Categorical Encoding

Machine learning models require numerical input. We will convert categories into numbers.

  1. Click + Add Step > Transform.
  2. Search for Encode Categorical.
  3. Choose One-Hot Encode for nominal data (e.g., Color, City).
  4. Select Label Prediction for ordinal data (e.g., Ranking).
  5. Click Add.

Step 5: Export to SageMaker Feature Store

Once transformations are complete, we persist the data for reuse.

  1. Click on Export at the top of the Data Wrangler flow.
  2. Choose Amazon SageMaker Feature Store.
  3. Define a Feature Group Name (e.g., transformed-housing-features).
  4. Configure the Record Identifier (usually a unique ID column) and Event Time column.
  5. Click Run.

Checkpoints

CheckpointActionExpected Result
Data ImportCheck Data Wrangler preview.Table displays first 100 rows correctly.
ScalingCheck the 'Price' column values.All values are between 0.0 and 1.0.
EncodingLook for new columns like City_NewYork.Original 'City' column is replaced by binary flags.
PersistenceRun aws sagemaker list-feature-groups in CLI.The group name created in Step 5 appears in the list.

Troubleshooting

ErrorLikely CauseSolution
AccessDenied on S3Missing IAM PermissionsEnsure your SageMaker Execution Role has s3:GetObject and s3:ListBucket.
ResourceLimitExceededToo many active Canvas appsShutdown existing SageMaker Canvas apps before starting a new one.
Scaling failedNon-numeric data in columnUse the "Handle missing values" transform to remove nulls or strings first.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources will result in ongoing hourly charges for SageMaker Canvas and storage costs for S3/Feature Store.

  1. Stop SageMaker Canvas: In the SageMaker Console, go to Canvas and click Log out or Delete app.
  2. Delete Feature Group:
    bash
    aws sagemaker delete-feature-group --feature-group-name transformed-housing-features
  3. Empty and Delete S3 Bucket:
    bash
    aws s3 rb s3://$BUCKET_NAME --force

Stretch Challenge

Task: Implement Feature Hashing for a high-cardinality categorical column (a column with thousands of unique values). Compare the number of output columns generated by Feature Hashing versus One-Hot Encoding.

Hint

Use the "Search for Transforms" box in Data Wrangler and type "Hashing". Note how it allows you to specify a fixed number of output features to save memory.

Cost Estimate

  • SageMaker Canvas: ~$1.90 per hour (instance cost) + $0.15 per million cells processed.
  • S3 Storage: $0.023 per GB (negligible for this lab).
  • Feature Store: Free tier includes 10M write operations/month; otherwise, $0.10 per GB-month.
  • Estimated Total: < $5.00 if completed and torn down within 1 hour.

Concept Review

TechniqueData TypeBest Use Case
Min-Max ScalingNumericNormalizing features to a range [0,1] for algorithms like KNN.
Z-Score (Standardization)NumericHandling outliers by centering data around mean=0, std=1.
One-Hot EncodingCategoricalNominal data with low cardinality (e.g., Red, Blue, Green).
Label EncodingCategoricalOrdinal data where order matters (e.g., Small, Medium, Large).
Binary EncodingCategoricalHigh cardinality data to reduce dimensionality vs One-Hot.

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free