Unit 1 Study Guide: Data Preparation for Machine Learning (ML)

This guide covers the foundational processes of preparing, cleaning, and transforming data for machine learning, specifically focusing on the AWS ecosystem as part of the ML Engineer Associate curriculum.

Learning Objectives

After studying this guide, you should be able to:

Identify and categorize structured, semi-structured, and unstructured data formats.
Explain the importance of data deduplication and standardization in preventing model bias.
Select appropriate AWS services (e.g., Glue DataBrew, SageMaker Clarify) for specific data preparation tasks.
Apply feature engineering techniques to enhance model predictive power.
Calculate and interpret pre-training bias metrics like Class Imbalance (CI).

Key Terms & Glossary

Feature: An individual measurable property or characteristic of the data (e.g., "pixel value" or "average spend").
Imputation: The technique of replacing missing data with substituted values (mean, median, or mode).
Deduplication: The process of identifying and removing repeated entries to prevent training bias.
Class Imbalance (CI): A state where certain labels in a dataset are represented more frequently than others, leading to skewed model results.
Standardization: Rescaling features so they have a mean of 0 and a standard deviation of 1.
PII (Personally Identifiable Information): Sensitive data that can identify an individual, requiring masking or anonymization for compliance.

The "Big Idea"

[!IMPORTANT] Garbage In, Garbage Out (GIGO): The quality of a Machine Learning model is capped by the quality of the data used to train it. Data preparation is not just a "cleanup" phase; it is the strategic process of turning raw, noisy information into a high-signal input that allows algorithms to discover meaningful patterns.

Formula / Concept Box

Concept	Definition / Formula	Use Case
Standardization (Z-score)	$z = \frac{x - \mu}{\sigma}$	When features have different scales (e.g., Age vs. Income).
Class Imbalance (CI)	$CI = \frac{n_a - n_b}{n_a + n_b}$	Measuring the gap between the majority and minority class.
DPL	Difference in Proportions of Labels	Comparing positive outcomes across different facets (e.g., gender).

Hierarchical Outline

I. Data Types and Formats
- Structured Data: Predictable schema (CSV, Parquet, Databases).
- Semi-structured Data: Self-describing, flexible schema (JSON, XML).
- Unstructured Data: No predefined schema (Images, Audio, Video, Text).
II. Data Cleaning and Integrity
- Deduplication: Using AWS Glue DataBrew for visual cleanup.
- Handling Missingness: Dropping vs. Imputing (Mean/Median/Mode).
- Data Quality: Validating integrity via AWS Glue Data Quality.
III. Feature Engineering and Transformation
- Reformatting: Restructuring data types and harmonizing formats.
- Scaling: Using StandardScaler to prevent large values from dominating the model.
- New Feature Creation: Deriving insights (e.g., "Avg visits per month" from "Visit history").
IV. Bias and Compliance
- Pre-training Bias: Detected using SageMaker Clarify.
- Mitigation: Resampling, shuffling, or synthetic data generation.
- Security: Data masking, anonymization, and encryption (KMS).

Visual Anchors

The Data Preparation Pipeline

Loading Diagram...

Feature Scaling Impact

This TikZ diagram visualizes how standardization centers data around a common scale, which is essential for algorithms like SVM or Linear Regression.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Term: Synthetic Data Generation
- Definition: Creating artificial data points that mimic the statistical properties of the real dataset.
- Example: If a fraud detection dataset only has 1% "fraud" cases, you generate synthetic fraud examples to help the model learn the pattern without being overwhelmed by "non-fraud" cases.
Term: Data Anonymization
- Definition: The process of removing or modifying PII so that individuals cannot be identified.
- Example: Replacing specific names in a medical dataset with generic IDs (e.g., "Patient_001") to comply with HIPAA regulations.

Worked Examples

1. Python Implementation: Standardization

Using scikit-learn to prepare a feature set (as referenced in the SageMaker workflow).

python

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample dataset: Age and Yearly Income
data = pd.DataFrame({'Age': [25, 45, 35, 50], 'Income': [50000, 120000, 80000, 150000]})

# Initialize scaler
scaler = StandardScaler()

# Fit and transform the data
standardized_data = scaler.fit_transform(data)

print(standardized_data)
# Output will be centered around 0 with unit variance

2. AWS Glue DataBrew: Deduplication

Connect: Connect DataBrew to your S3 bucket.
Transform: Select the "Remove Duplicates" transform from the visual menu.
Scope: Choose to remove duplicates based on a specific unique key (e.g., customer_id) or the entire row.
Job: Run the DataBrew job to output the cleaned CSV back to S3 for training.

Checkpoint Questions

Which AWS service provides a visual interface for data cleaning without requiring code?
- Answer: AWS Glue DataBrew.
What is the primary risk of having duplicate entries in a training dataset?
- Answer: It introduces bias by over-representing specific entries, potentially leading to an imbalanced model.
True or False: Feature engineering involves adding new external data to your dataset.
- Answer: False. It involves extracting more information from existing data.
What metric would you use to measure the difference in positive labels between two groups?
- Answer: Difference in Proportions of Labels (DPL).

Muddy Points & Cross-Refs

Standardization vs. Normalization:
- Standardization (Z-score) scales data based on the mean and standard deviation. Use this for algorithms that assume a Gaussian distribution (e.g., Linear Regression).
- Normalization (Min-Max) scales data to a fixed range, usually [0, 1]. Use this when you have specific boundaries or image pixel values.
Bias Detection: While SageMaker Clarify is the primary tool for detecting bias, Glue Data Quality is better for technical integrity checks (e.g., "Is this column null?").

Comparison Tables

Data Preparation Tools Comparison

Tool	Primary User	Best For...	Coding Required?
SageMaker Notebooks	Data Scientists	Custom, complex transformations (Pandas/PySpark).	Yes (Python/R)
AWS Glue DataBrew	ML Engineers / Analysts	Fast, visual cleaning and deduplication.	No (Visual)
AWS Glue (ETL)	Data Engineers	Large-scale production data pipelines.	Yes (Spark/Python)

Structured vs. Unstructured Data

Feature	Structured	Unstructured
Schema	Rigid, Pre-defined	None / Internal only
Examples	SQL Tables, CSV	JPG, MP4, PDF, Text
ML Approach	Tabular Algorithms (XGBoost)	Deep Learning (CNNs, LLMs)
Searchability	Easy via queries	Difficult (requires indexing/embedding)