Mastering Data Transformation and Feature Engineering for AWS ML
Transform data and perform feature engineering
Mastering Data Transformation and Feature Engineering for AWS ML
This guide covers the critical phase of the machine learning lifecycle where raw data is converted into high-quality features. As emphasized in the AWS Certified Machine Learning Engineer Associate exam, this process is both an art and a science, directly impacting model performance.
Learning Objectives
- Execute Data Cleaning: Identify and treat outliers, missing values, and duplicate records.
- Apply Scaling Techniques: Differentiate between and implement Normalization and Standardization.
- Encode Categorical Data: Select the correct encoding strategy (One-Hot, Label, Binary, or Hashing) based on cardinality and model type.
- Handle Unstructured Data: Utilize AWS services for feature extraction from images and text.
- Leverage AWS Tooling: Use SageMaker Data Wrangler, Feature Store, and AWS Glue for scalable transformations.
Key Terms & Glossary
- Feature Engineering: The process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data.
- Imputation: The process of replacing missing data with substituted values (e.g., mean, median, or most frequent value).
- Cardinality: The number of unique values in a categorical feature. "High cardinality" means many unique values (e.g., ZIP codes).
- Stemming/Lemmatization: Text processing techniques to reduce words to their root form (e.g., "running" to "run").
- Data Leakage: An error where information from outside the training dataset is used to create the model, leading to overly optimistic performance.
The "Big Idea"
[!IMPORTANT] Raw data is rarely ready for a model. If the "ingestion" phase is about getting data into the cloud, the "transformation" phase is about translating that data into a language the algorithm understands. High-quality features are more influential on model success than the algorithm choice itself.
Formula / Concept Box
| Technique | Formula | Best Use Case |
|---|---|---|
| Min-Max Scaling (Normalization) | When the distribution is unknown or not Gaussian; scale matters (0 to 1). | |
| Z-Score (Standardization) | When data follows a Gaussian distribution; handles outliers better than normalization. | |
| Log Transformation | To reduce the effect of right-skewed data and stabilize variance. |
Hierarchical Outline
- Data Preprocessing (Cleaning)
- Handling Missing Values: Deletion vs. Imputation (Mean/Median/Mode).
- Outlier Detection: Identifying extreme values that skew results.
- Deduplication: Removing redundant records to prevent overfitting.
- Numeric Feature Engineering
- Scaling: Adjusting the range of features so one doesn't dominate others.
- Binning: Converting continuous numbers into discrete categories (e.g., Age to Age-Groups).
- Categorical Feature Engineering
- Label Encoding: Assigning integers to ordinal data (where order matters).
- One-Hot Encoding (OHE): Creating binary columns for nominal data (no order).
- Feature Hashing: Handling high-cardinality data efficiently.
- Unstructured Data Features
- Image: Extraction via Amazon Rekognition or SageMaker JumpStart.
- Text: Processing via Amazon Comprehend (NLP) or manual Tokenization.
- AWS Managed Services
- SageMaker Data Wrangler: Visual UI for 300+ built-in transformations.
- SageMaker Feature Store: Centralized repository for feature reuse and sync.
Visual Anchors
The Transformation Pipeline
Scaling Visualized
Definition-Example Pairs
- Binning: Converting a continuous variable into buckets.
- Example: Turning "Home Square Footage" into "Small," "Medium," and "Large" categories to help a tree-based model find split points faster.
- One-Hot Encoding: Converting a categorical variable into multiple binary columns.
- Example: A "Color" column with [Red, Green, Blue] becomes three columns:
is_red,is_green,is_bluewith 1s and 0s.
- Example: A "Color" column with [Red, Green, Blue] becomes three columns:
- Tokenization: Breaking text into individual words or symbols.
- Example: Converting the sentence "AWS is great" into a list: for sentiment analysis.
Worked Examples
Example 1: Min-Max Scaling
Scenario: You have a dataset of house prices where the minimum is $100,000 and the maximum is $500,000. Normalize the price of a house that costs $200,000.
- Identify variables: , , $X_{max} = 500,000.
- Apply Formula: \frac{200,000 - 100,000}{500,000 - 100,000} = \frac{100,000}{400,000}$.
- Result: $0.25.
Example 2: Handling High Cardinality
Scenario: You are predicting delivery times across 10,000 different zip codes. One-hot encoding would create 10,000 new columns, crashing your memory.
- Solution: Use Feature Hashing. This maps the zip codes to a fixed number of "buckets" (e.g., 512) using a hash function. While collisions (two zip codes in one bucket) may occur, it drastically reduces dimensionality and compute costs.
Comparison Tables
Normalization vs. Standardization
| Feature | Normalization | Standardization |
|---|---|---|
| Range | Fixed (usually 0 to 1) | Not fixed (mean 0, std 1) |
| Outlier Sensitivity | Very Sensitive | Less Sensitive |
| Use Case | Algorithms that don't assume distribution (k-NN, Neural Nets) | Algorithms that assume Gaussian distribution (Linear Regression, SVMs) |
| Scaling Tool | Min-Max Scaler | Z-Score Scaler |
Categorical Encoding Strategies
| Strategy | Data Type | Advantage |
|---|---|---|
| Label Encoding | Ordinal (1st, 2nd, 3rd) | Preserves order relationship. |
| One-Hot Encoding | Nominal (Red, Blue) | Prevents model from assuming a rank. |
| Binary Encoding | High Cardinality | Fewer columns than OHE (Log2 representation). |
| Hashing | Very High Cardinality | Fixed-size output, extremely memory efficient. |
Checkpoint Questions
- Which AWS service would you use to store engineered features so they can be accessed for both offline training and real-time inference?
- If your numerical data is heavily skewed and contains significant outliers, would you prefer Min-Max Scaling or Z-score Standardization?
- True or False: One-hot encoding is the preferred method for a categorical feature with 5,000 unique categories.
- Which text processing technique reduces "fishing," "fished," and "fisher" to the base word "fish"?
▶Click to see answers
- Amazon SageMaker Feature Store.
- Z-score Standardization (it is more robust to outliers).
- False (Use Binary Encoding or Hashing for high cardinality).
- Stemming (or Lemmatization if considering context).
Muddy Points & Cross-Refs
- Offline vs. Online Feature Store: The offline store is usually in S3 (parquet) for batch training. The online store is in a low-latency DB (like DynamoDB) for real-time inference. Keeping them in sync is a major challenge SageMaker Feature Store solves automatically.
- When to use Data Wrangler vs. Glue? Use Data Wrangler for exploratory, visual, and SageMaker-native ML workflows. Use AWS Glue for massive-scale Spark ETL jobs or if the data needs to be integrated into a general-purpose data warehouse (Redshift).
- Deep Dive: See "Task 1.3: Data Integrity" for how to validate that these transformations didn't introduce bias.