Study Guide1,342 words

Feature Engineering Techniques: Scaling, Transformation, and Encoding

Feature engineering techniques (for example, data scaling and standardization, feature splitting, binning, log transformation, normalization)

Feature Engineering Techniques: Scaling, Transformation, and Encoding

This study guide covers essential feature engineering techniques required for the AWS Certified Machine Learning Engineer Associate exam, focusing on how to transform raw data into high-quality features that improve model convergence and accuracy.

Learning Objectives

  • Differentiate between Normalization and Standardization and identify when to use each.
  • Apply Log Transformations to handle skewed data and outliers.
  • Explain the processes of Binning, Feature Splitting, and Feature Combining.
  • Select appropriate Encoding techniques (One-Hot, Label, Binary, Hashing) for categorical data.
  • Identify AWS services like SageMaker Data Wrangler and SageMaker Feature Store used for these transformations.

Key Terms & Glossary

  • Feature Engineering: The process of using domain knowledge to extract features from raw data via data mining techniques.
  • Normalization (Min-Max Scaling): Rescaling the range of features to scale the data in range [0, 1] or [-1, 1].
  • Standardization (Z-score Normalization): Transforming data to have a mean of 0 and a standard deviation of 1.
  • High Cardinality: A property of a dataset where a column has a very large number of unique values (e.g., User IDs).
  • One-Hot Encoding: A process of converting categorical variables into a form that could be provided to ML algorithms to do a better job in prediction.

The "Big Idea"

Feature engineering is the most critical step in the Machine Learning pipeline. While algorithms are powerful, they are only as good as the data they consume. Raw data is often messy, on different scales, and contains non-linear relationships. Feature engineering "distills" the raw data, making the underlying patterns more apparent to the model, which leads to faster convergence (training speed) and better generalization (accuracy on new data).

Formula / Concept Box

TechniqueMathematical FormulaKey Use Case
Min-Max Scalingx=xmin(x)max(x)min(x)x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}Algorithms where scale matters (k-NN, Neural Networks)
Z-Score (Standardization)z=xμσz = \frac{x - \mu}{\sigma}Algorithms assuming Gaussian distribution (Linear/Logistic Regression)
Log Transformationy=log(x)y = \log(x)Fixing right-skewed data and dampening the effect of outliers

Hierarchical Outline

  • I. Numerical Feature Engineering
    • A. Scaling Techniques
      • Normalization: Scales to [0,1]; sensitive to outliers.
      • Standardization: Centers on mean 0; robust to outliers.
    • B. Distribution Adjustments
      • Log Transformation: Compresses the range of large values; handles "long tails."
    • C. Discretization
      • Binning: Converting continuous variables into categorical buckets (e.g., Age to Age-Groups).
  • II. Categorical Feature Engineering
    • A. Nominal Encoding: One-Hot Encoding (low cardinality), Hashing (high cardinality).
    • B. Ordinal Encoding: Label Encoding (where order matters, e.g., Small < Medium < Large).
  • III. Feature Derivation
    • A. Feature Splitting: Breaking a date into Day/Month/Year.
    • B. Feature Combining: Multiplying Height and Width to get Area.
  • IV. AWS Tooling
    • SageMaker Data Wrangler: GUI-based feature engineering.
    • SageMaker Feature Store: Centralized repository to store and share features.

Visual Anchors

Scaling Decision Flowchart

Loading Diagram...

Visualization of Standardization

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Binning: Converting continuous values into groups. Example: Grouping house ages into "New" (<5 years), "Modern" (5-20 years), and "Historic" (>20 years).
  • Feature Splitting: Breaking one feature into multiple. Example: Splitting a "Timestamp" column into "Is_Weekend" (Boolean) and "Hour_of_Day" (Integer).
  • Log Transformation: Applying a logarithm to values. Example: In a dataset of individual wealth, the few billionaires create a massive skew; log-transforming wealth makes the distribution look more normal/bell-shaped.

Worked Examples

Example 1: Min-Max Scaling

Scenario: You have a feature "Income" with a minimum of $20,000 and a maximum of $120,000. You need to normalize a value of $70,000.

Step-by-Step:

  1. Identify values: x=70,000x = 70,000, min=20,000\min = 20,000, $\max = 120,000.
  2. Apply Formula: x' = (70,000 - 20,000) / (120,000 - 20,000)$.
  3. Calculate: $50,000 / 100,000 = 0.5$.
  4. Result: The normalized value is $0.5.

Example 2: Log Transformation for Skew

Scenario: A dataset of video view counts is highly right-skewed (most videos have 10 views, a few have 10,000,000). Solution: Apply y=log10(x)y = \log10(x). A video with 10 views becomes 1, and a video with 10,000,000 views becomes 7. This reduces the "distance" between the outliers and the majority of the data, helping the model learn more effectively.

Checkpoint Questions

  1. Which scaling technique should you use if your data contains significant outliers that you want to maintain but dampen?
  2. Why is One-Hot Encoding preferred over Label Encoding for the feature "Color" (Red, Blue, Green) in a Linear Regression model?
  3. Which AWS service provides a visual interface for performing 300+ built-in data transformations without writing code?
  4. When should you use Feature Hashing instead of One-Hot Encoding?
Click for Answers
  1. Standardization (Z-score) or Log Transformation, as Min-Max scaling would squash the non-outliers into a very tiny range.
  2. Because Label Encoding (1, 2, 3) implies a mathematical order (Blue is 'greater' than Red), which is not true for colors. One-hot encoding treats them as independent categories.
  3. Amazon SageMaker Data Wrangler.
  4. When you have high cardinality (thousands of unique categories) and need to limit the number of resulting columns to save compute/memory.

Muddy Points & Cross-Refs

  • Normalization vs. Standardization: A common confusion point. Remember: Normalization = New Range (0 to 1). Standardization = Shape (Gaussian/Bell curve).
  • Deep Learning: Neural networks almost always require input scaling (usually Normalization) to prevent "vanishing/exploding gradients."
  • SageMaker Feature Store: Don't confuse this with a database. It is a specialized store that manages the metadata and versions of features for both training (offline) and inference (online).

Comparison Tables

Encoding Comparison

MethodBest ForProsCons
One-HotLow cardinality nominal dataSimple, preserves all infoCauses "Dimensionality Explosion"
LabelOrdinal data (Small, Med, Large)Low memory footprintImplies false ranking for nominal data
BinaryMedium cardinalityFewer columns than One-HotSlightly less interpretable
HashingHigh cardinalityFixed memory usagePotential "collisions" (two labels get same hash)

Normalization vs. Standardization

FeatureNormalization (Min-Max)Standardization (Z-Score)
Resulting RangeTypically [0, 1]Not bounded (Mean 0, Std Dev 1)
Outlier SensitivityVery highLow
DistributionChanges the distributionRetains distribution shape
Model Preferencek-NN, Neural NetworksLinear/Logistic Regression, SVM

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free