Unit 1 Study Guide: Data Preparation for Machine Learning
Unit 1: Data Preparation for Machine Learning (ML)
Unit 1 Study Guide: Data Preparation for Machine Learning (ML)
This guide covers the foundational processes of preparing, cleaning, and transforming data for machine learning, specifically focusing on the AWS ecosystem as part of the ML Engineer Associate curriculum.
Learning Objectives
After studying this guide, you should be able to:
- Identify and categorize structured, semi-structured, and unstructured data formats.
- Explain the importance of data deduplication and standardization in preventing model bias.
- Select appropriate AWS services (e.g., Glue DataBrew, SageMaker Clarify) for specific data preparation tasks.
- Apply feature engineering techniques to enhance model predictive power.
- Calculate and interpret pre-training bias metrics like Class Imbalance (CI).
Key Terms & Glossary
- Feature: An individual measurable property or characteristic of the data (e.g., "pixel value" or "average spend").
- Imputation: The technique of replacing missing data with substituted values (mean, median, or mode).
- Deduplication: The process of identifying and removing repeated entries to prevent training bias.
- Class Imbalance (CI): A state where certain labels in a dataset are represented more frequently than others, leading to skewed model results.
- Standardization: Rescaling features so they have a mean of 0 and a standard deviation of 1.
- PII (Personally Identifiable Information): Sensitive data that can identify an individual, requiring masking or anonymization for compliance.
The "Big Idea"
[!IMPORTANT] Garbage In, Garbage Out (GIGO): The quality of a Machine Learning model is capped by the quality of the data used to train it. Data preparation is not just a "cleanup" phase; it is the strategic process of turning raw, noisy information into a high-signal input that allows algorithms to discover meaningful patterns.
Formula / Concept Box
| Concept | Definition / Formula | Use Case |
|---|---|---|
| Standardization (Z-score) | When features have different scales (e.g., Age vs. Income). | |
| Class Imbalance (CI) | Measuring the gap between the majority and minority class. | |
| DPL | Difference in Proportions of Labels | Comparing positive outcomes across different facets (e.g., gender). |
Hierarchical Outline
- I. Data Types and Formats
- Structured Data: Predictable schema (CSV, Parquet, Databases).
- Semi-structured Data: Self-describing, flexible schema (JSON, XML).
- Unstructured Data: No predefined schema (Images, Audio, Video, Text).
- II. Data Cleaning and Integrity
- Deduplication: Using AWS Glue DataBrew for visual cleanup.
- Handling Missingness: Dropping vs. Imputing (Mean/Median/Mode).
- Data Quality: Validating integrity via AWS Glue Data Quality.
- III. Feature Engineering and Transformation
- Reformatting: Restructuring data types and harmonizing formats.
- Scaling: Using
StandardScalerto prevent large values from dominating the model. - New Feature Creation: Deriving insights (e.g., "Avg visits per month" from "Visit history").
- IV. Bias and Compliance
- Pre-training Bias: Detected using SageMaker Clarify.
- Mitigation: Resampling, shuffling, or synthetic data generation.
- Security: Data masking, anonymization, and encryption (KMS).
Visual Anchors
The Data Preparation Pipeline
Feature Scaling Impact
This TikZ diagram visualizes how standardization centers data around a common scale, which is essential for algorithms like SVM or Linear Regression.
\begin{tikzpicture} % Axes \draw[->] (-1,0) -- (4,0) node[right] {Feature Value}; \draw[->] (0,-1) -- (0,3) node[above] {Density};
% Unscaled distribution (wide and shifted)
\draw[thick, blue] (1,0) .. controls (2,2) and (3,2) .. (4,0.2);
\node[blue] at (3.5, 1.5) {Raw Data (Skewed)};
% Standardized distribution (centered at 0)
\draw[thick, red] (-0.8,0) .. controls (0,2.5) and (0,2.5) .. (0.8,0);
\node[red] at (1.2, 2.5) {Standardized (Mean=0)};\end{tikzpicture}
Definition-Example Pairs
-
Term: Synthetic Data Generation
- Definition: Creating artificial data points that mimic the statistical properties of the real dataset.
- Example: If a fraud detection dataset only has 1% "fraud" cases, you generate synthetic fraud examples to help the model learn the pattern without being overwhelmed by "non-fraud" cases.
-
Term: Data Anonymization
- Definition: The process of removing or modifying PII so that individuals cannot be identified.
- Example: Replacing specific names in a medical dataset with generic IDs (e.g., "Patient_001") to comply with HIPAA regulations.
Worked Examples
1. Python Implementation: Standardization
Using scikit-learn to prepare a feature set (as referenced in the SageMaker workflow).
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Sample dataset: Age and Yearly Income
data = pd.DataFrame({'Age': [25, 45, 35, 50], 'Income': [50000, 120000, 80000, 150000]})
# Initialize scaler
scaler = StandardScaler()
# Fit and transform the data
standardized_data = scaler.fit_transform(data)
print(standardized_data)
# Output will be centered around 0 with unit variance2. AWS Glue DataBrew: Deduplication
- Connect: Connect DataBrew to your S3 bucket.
- Transform: Select the "Remove Duplicates" transform from the visual menu.
- Scope: Choose to remove duplicates based on a specific unique key (e.g.,
customer_id) or the entire row. - Job: Run the DataBrew job to output the cleaned CSV back to S3 for training.
Checkpoint Questions
- Which AWS service provides a visual interface for data cleaning without requiring code?
- Answer: AWS Glue DataBrew.
- What is the primary risk of having duplicate entries in a training dataset?
- Answer: It introduces bias by over-representing specific entries, potentially leading to an imbalanced model.
- True or False: Feature engineering involves adding new external data to your dataset.
- Answer: False. It involves extracting more information from existing data.
- What metric would you use to measure the difference in positive labels between two groups?
- Answer: Difference in Proportions of Labels (DPL).
Muddy Points & Cross-Refs
- Standardization vs. Normalization:
- Standardization (Z-score) scales data based on the mean and standard deviation. Use this for algorithms that assume a Gaussian distribution (e.g., Linear Regression).
- Normalization (Min-Max) scales data to a fixed range, usually [0, 1]. Use this when you have specific boundaries or image pixel values.
- Bias Detection: While SageMaker Clarify is the primary tool for detecting bias, Glue Data Quality is better for technical integrity checks (e.g., "Is this column null?").
Comparison Tables
Data Preparation Tools Comparison
| Tool | Primary User | Best For... | Coding Required? |
|---|---|---|---|
| SageMaker Notebooks | Data Scientists | Custom, complex transformations (Pandas/PySpark). | Yes (Python/R) |
| AWS Glue DataBrew | ML Engineers / Analysts | Fast, visual cleaning and deduplication. | No (Visual) |
| AWS Glue (ETL) | Data Engineers | Large-scale production data pipelines. | Yes (Spark/Python) |
Structured vs. Unstructured Data
| Feature | Structured | Unstructured |
|---|---|---|
| Schema | Rigid, Pre-defined | None / Internal only |
| Examples | SQL Tables, CSV | JPG, MP4, PDF, Text |
| ML Approach | Tabular Algorithms (XGBoost) | Deep Learning (CNNs, LLMs) |
| Searchability | Easy via queries | Difficult (requires indexing/embedding) |