Data Cleaning and Transformation: The MLA-C01 Essentials

This study guide covers the critical processes of preparing raw data for Machine Learning models, focusing on the techniques and AWS tools required for the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam.

Learning Objectives

After studying this guide, you should be able to:

Identify and treat outliers using Z-score and IQR methods.
Apply imputation strategies to handle missing data points.
Perform deduplication to prevent model overfitting and skew.
Utilize AWS Glue DataBrew and SageMaker Data Wrangler for automated cleaning.
Understand the impact of data quality on model performance and reliability.

Key Terms & Glossary

Imputation: The process of replacing missing data with substituted values (e.g., mean, median, mode).
Deduplication: Identifying and removing identical or near-identical records from a dataset.
Outlier: A data point that significantly deviates from the rest of the observations.
Overfitting: A modeling error that occurs when a function is too closely fit to a limited set of data points, often caused by duplicate data.
Skewness: The measure of the asymmetry of the probability distribution of a real-valued random variable.

The "Big Idea"

[!IMPORTANT] Garbage In, Garbage Out (GIGO): The performance of even the most sophisticated Machine Learning model is capped by the quality of its training data. Data cleaning isn't just "housekeeping"; it is a core engineering task that ensures models learn the underlying signal rather than the noise (random errors or duplicates).

Formula / Concept Box

Method	Formula / Rule	Use Case
Z-Score	$z = \frac{x - \mu}{\sigma}$	Best for normally distributed data. Outlier if $
IQR (Interquartile Range)	$IQR = Q3 - Q1	Best for skewed data. Outlier if < Q1 - 1.5 \cdot IQR $or$ > Q3 + 1.5 \cdot IQR$.
Mean Imputation	$\bar{x} = \frac{\sum x_i}{n}$	Numerical data without significant outliers.
Median Imputation	Middle value	Numerical data with significant outliers/skew.

Hierarchical Outline

I. Handling Missing Data
- Dropping Data: Removing rows (samples) or columns (features) when data is missing.
- Simple Imputation: Replacing null with Mean, Median, or Most Frequent value.
- AWS Implementation: Using SimpleImputer in Scikit-Learn or built-in transformations in Glue DataBrew.
II. Outlier Detection & Treatment
- Natural Outliers: Valid extreme variations (e.g., a professional athlete's speed).
- Artificial Outliers: Errors in collection (e.g., a human age of 250).
- Treatment: Rescaling, removing, or flagging for the model.
III. Deduplication
- Purpose: Prevents the model from weighing specific samples too heavily (overfitting).
- Techniques: Exact match vs. Fuzzy matching.
IV. Data Combination & Merging
- Sources: Combining data from S3, RDS, and Redshift using AWS Glue or Spark.

Visual Anchors

Missing Data Strategy Flowchart

Loading Diagram...

Outlier Detection (Z-Score vs. IQR)

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Natural Outlier
- Definition: A valid data point that represents an extreme but real variation in the population.
- Example: In a dataset of household incomes, Jeff Bezos's income is a natural outlier—it is real but extremely far from the average.
Artificial Outlier
- Definition: An error caused by faulty data collection, entry mistakes, or sensor malfunctions.
- Example: A temperature sensor in a room recording 500 degrees Celsius due to a short circuit.
Deduplication
- Definition: The removal of duplicate entries to ensure each unique entity is represented only once.
- Example: A retail database with two entries for "John Doe, 123 Maple St" due to two different sign-up forms.

Worked Examples

Example 1: Python Imputation using Scikit-Learn

Suppose you have a dataset with missing values. Here is how to use the SimpleImputer to fill them using the mean strategy.

python

from sklearn.impute import SimpleImputer
import numpy as np

# Sample dataset with missing values (None)
X = np.array([[1, 2], [None, 4], [5, None]])

# Create imputer object with 'mean' strategy
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
cleaned_data = imputer.fit_transform(X)

# Result:
# [[1. 2.]
#  [3. 4.]  <-- 3 is mean of 1 and 5
#  [5. 3.]] <-- 3 is mean of 2 and 4

Checkpoint Questions

Why is Z-score unreliable for detecting outliers in a right-skewed dataset?
Under what condition should you drop an entire feature column due to missing data?
How does duplicate data lead to model overfitting?
Which AWS service offers 350+ pre-built transformations for data cleaning without writing code?

▶Click to view answers

Z-score assumes a normal (Gaussian) distribution; in skewed data, the mean and standard deviation are themselves pulled by the skew, masking outliers.
When the amount of missing data is substantial (e.g., >40-50%) and the feature is not critical for the target prediction.
The model "memorizes" the repeated patterns of the duplicate data, making it perform well on training data but poorly on unseen new data.
AWS Glue DataBrew.

Muddy Points & Cross-Refs

Z-score vs. IQR: Students often struggle with when to use which. Use IQR for non-normal or skewed data (like income or house prices). Use Z-score for data that follows a bell curve (like heights or standardized test scores).
Dropping vs. Imputing: If you have 1 million rows and only 5 are missing data, dropping is fine. If you have 100 rows and 5 are missing, imputation is necessary to preserve the small amount of data you have.
Deep Study Pointers: Check out the documentation for Amazon SageMaker Data Wrangler for visual data flow orchestration.

Comparison Tables

Handling Missing Data

Strategy	Pros	Cons
Dropping Rows	Quick, easy, preserves data integrity for remaining rows.	Reduces dataset size; can introduce bias if missingness isn't random.
Mean Imputation	Keeps sample size large; simple to implement.	Reduces variance; doesn't account for relationships between features.
Mode Imputation	Works for categorical data (e.g., "Red", "Blue").	Can create a massive imbalance in categorical features.

AWS Tool Comparison

Feature	AWS Glue DataBrew	SageMaker Data Wrangler
User Interface	Visual, point-and-click.	Integrated into SageMaker Studio.
Target User	Data Analysts / Data Scientists.	ML Engineers / Data Scientists.
Customization	Python UDFs supported.	Deeply integrated with SageMaker Pipelines.
Core Strength	Quality reports & 350+ transforms.	End-to-end ML feature engineering.