Feature Engineering Techniques: Scaling, Transformation, and Encoding

This study guide covers essential feature engineering techniques required for the AWS Certified Machine Learning Engineer Associate exam, focusing on how to transform raw data into high-quality features that improve model convergence and accuracy.

Learning Objectives

Differentiate between Normalization and Standardization and identify when to use each.
Apply Log Transformations to handle skewed data and outliers.
Explain the processes of Binning, Feature Splitting, and Feature Combining.
Select appropriate Encoding techniques (One-Hot, Label, Binary, Hashing) for categorical data.
Identify AWS services like SageMaker Data Wrangler and SageMaker Feature Store used for these transformations.

Key Terms & Glossary

Feature Engineering: The process of using domain knowledge to extract features from raw data via data mining techniques.
Normalization (Min-Max Scaling): Rescaling the range of features to scale the data in range [0, 1] or [-1, 1].
Standardization (Z-score Normalization): Transforming data to have a mean of 0 and a standard deviation of 1.
High Cardinality: A property of a dataset where a column has a very large number of unique values (e.g., User IDs).
One-Hot Encoding: A process of converting categorical variables into a form that could be provided to ML algorithms to do a better job in prediction.

The "Big Idea"

Feature engineering is the most critical step in the Machine Learning pipeline. While algorithms are powerful, they are only as good as the data they consume. Raw data is often messy, on different scales, and contains non-linear relationships. Feature engineering "distills" the raw data, making the underlying patterns more apparent to the model, which leads to faster convergence (training speed) and better generalization (accuracy on new data).

Formula / Concept Box

Technique	Mathematical Formula	Key Use Case
Min-Max Scaling	$x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}$	Algorithms where scale matters (k-NN, Neural Networks)
Z-Score (Standardization)	$z = \frac{x - \mu}{\sigma}$	Algorithms assuming Gaussian distribution (Linear/Logistic Regression)
Log Transformation	$y = \log(x)$	Fixing right-skewed data and dampening the effect of outliers

Hierarchical Outline

I. Numerical Feature Engineering
- A. Scaling Techniques
  - Normalization: Scales to [0,1]; sensitive to outliers.
  - Standardization: Centers on mean 0; robust to outliers.
- B. Distribution Adjustments
  - Log Transformation: Compresses the range of large values; handles "long tails."
- C. Discretization
  - Binning: Converting continuous variables into categorical buckets (e.g., Age to Age-Groups).
II. Categorical Feature Engineering
- A. Nominal Encoding: One-Hot Encoding (low cardinality), Hashing (high cardinality).
- B. Ordinal Encoding: Label Encoding (where order matters, e.g., Small < Medium < Large).
III. Feature Derivation
- A. Feature Splitting: Breaking a date into Day/Month/Year.
- B. Feature Combining: Multiplying Height and Width to get Area.
IV. AWS Tooling
- SageMaker Data Wrangler: GUI-based feature engineering.
- SageMaker Feature Store: Centralized repository to store and share features.

Visual Anchors

Scaling Decision Flowchart

Loading Diagram...

Visualization of Standardization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Binning: Converting continuous values into groups. Example: Grouping house ages into "New" (<5 years), "Modern" (5-20 years), and "Historic" (>20 years).
Feature Splitting: Breaking one feature into multiple. Example: Splitting a "Timestamp" column into "Is_Weekend" (Boolean) and "Hour_of_Day" (Integer).
Log Transformation: Applying a logarithm to values. Example: In a dataset of individual wealth, the few billionaires create a massive skew; log-transforming wealth makes the distribution look more normal/bell-shaped.

Worked Examples

Example 1: Min-Max Scaling

Scenario: You have a feature "Income" with a minimum of $20,000 and a maximum of $120,000. You need to normalize a value of $70,000.

Step-by-Step:

Identify values: $x = 70,000$ , $\min = 20,000$ , $\max = 120,000$ .
Apply Formula: x' = (70,000 - 20,000) / (120,000 - 20,000).
Calculate: $50,000 / 100,000 = 0.5$.
Result: The normalized value is $0.5.

Example 2: Log Transformation for Skew

Scenario: A dataset of video view counts is highly right-skewed (most videos have 10 views, a few have 10,000,000). Solution: Apply $y = \log10(x)$ . A video with 10 views becomes 1, and a video with 10,000,000 views becomes 7. This reduces the "distance" between the outliers and the majority of the data, helping the model learn more effectively.

Checkpoint Questions

Which scaling technique should you use if your data contains significant outliers that you want to maintain but dampen?
Why is One-Hot Encoding preferred over Label Encoding for the feature "Color" (Red, Blue, Green) in a Linear Regression model?
Which AWS service provides a visual interface for performing 300+ built-in data transformations without writing code?
When should you use Feature Hashing instead of One-Hot Encoding?

▶Click for Answers

Standardization (Z-score) or Log Transformation, as Min-Max scaling would squash the non-outliers into a very tiny range.
Because Label Encoding (1, 2, 3) implies a mathematical order (Blue is 'greater' than Red), which is not true for colors. One-hot encoding treats them as independent categories.
Amazon SageMaker Data Wrangler.
When you have high cardinality (thousands of unique categories) and need to limit the number of resulting columns to save compute/memory.

Muddy Points & Cross-Refs

Normalization vs. Standardization: A common confusion point. Remember: Normalization = New Range (0 to 1). Standardization = Shape (Gaussian/Bell curve).
Deep Learning: Neural networks almost always require input scaling (usually Normalization) to prevent "vanishing/exploding gradients."
SageMaker Feature Store: Don't confuse this with a database. It is a specialized store that manages the metadata and versions of features for both training (offline) and inference (online).

Comparison Tables

Encoding Comparison

Method	Best For	Pros	Cons
One-Hot	Low cardinality nominal data	Simple, preserves all info	Causes "Dimensionality Explosion"
Label	Ordinal data (Small, Med, Large)	Low memory footprint	Implies false ranking for nominal data
Binary	Medium cardinality	Fewer columns than One-Hot	Slightly less interpretable
Hashing	High cardinality	Fixed memory usage	Potential "collisions" (two labels get same hash)

Normalization vs. Standardization

Feature	Normalization (Min-Max)	Standardization (Z-Score)
Resulting Range	Typically [0, 1]	Not bounded (Mean 0, Std Dev 1)
Outlier Sensitivity	Very high	Low
Distribution	Changes the distribution	Retains distribution shape
Model Preference	k-NN, Neural Networks	Linear/Logistic Regression, SVM