Study Guide1,085 words

Unit 1 Study Guide: Data Preparation for Machine Learning

Unit 1: Data Preparation for Machine Learning (ML)

Unit 1 Study Guide: Data Preparation for Machine Learning (ML)

This guide covers the foundational processes of preparing, cleaning, and transforming data for machine learning, specifically focusing on the AWS ecosystem as part of the ML Engineer Associate curriculum.

Learning Objectives

After studying this guide, you should be able to:

  • Identify and categorize structured, semi-structured, and unstructured data formats.
  • Explain the importance of data deduplication and standardization in preventing model bias.
  • Select appropriate AWS services (e.g., Glue DataBrew, SageMaker Clarify) for specific data preparation tasks.
  • Apply feature engineering techniques to enhance model predictive power.
  • Calculate and interpret pre-training bias metrics like Class Imbalance (CI).

Key Terms & Glossary

  • Feature: An individual measurable property or characteristic of the data (e.g., "pixel value" or "average spend").
  • Imputation: The technique of replacing missing data with substituted values (mean, median, or mode).
  • Deduplication: The process of identifying and removing repeated entries to prevent training bias.
  • Class Imbalance (CI): A state where certain labels in a dataset are represented more frequently than others, leading to skewed model results.
  • Standardization: Rescaling features so they have a mean of 0 and a standard deviation of 1.
  • PII (Personally Identifiable Information): Sensitive data that can identify an individual, requiring masking or anonymization for compliance.

The "Big Idea"

[!IMPORTANT] Garbage In, Garbage Out (GIGO): The quality of a Machine Learning model is capped by the quality of the data used to train it. Data preparation is not just a "cleanup" phase; it is the strategic process of turning raw, noisy information into a high-signal input that allows algorithms to discover meaningful patterns.

Formula / Concept Box

ConceptDefinition / FormulaUse Case
Standardization (Z-score)z=xμσz = \frac{x - \mu}{\sigma}When features have different scales (e.g., Age vs. Income).
Class Imbalance (CI)CI=nanbna+nbCI = \frac{n_a - n_b}{n_a + n_b}Measuring the gap between the majority and minority class.
DPLDifference in Proportions of LabelsComparing positive outcomes across different facets (e.g., gender).

Hierarchical Outline

  • I. Data Types and Formats
    • Structured Data: Predictable schema (CSV, Parquet, Databases).
    • Semi-structured Data: Self-describing, flexible schema (JSON, XML).
    • Unstructured Data: No predefined schema (Images, Audio, Video, Text).
  • II. Data Cleaning and Integrity
    • Deduplication: Using AWS Glue DataBrew for visual cleanup.
    • Handling Missingness: Dropping vs. Imputing (Mean/Median/Mode).
    • Data Quality: Validating integrity via AWS Glue Data Quality.
  • III. Feature Engineering and Transformation
    • Reformatting: Restructuring data types and harmonizing formats.
    • Scaling: Using StandardScaler to prevent large values from dominating the model.
    • New Feature Creation: Deriving insights (e.g., "Avg visits per month" from "Visit history").
  • IV. Bias and Compliance
    • Pre-training Bias: Detected using SageMaker Clarify.
    • Mitigation: Resampling, shuffling, or synthetic data generation.
    • Security: Data masking, anonymization, and encryption (KMS).

Visual Anchors

The Data Preparation Pipeline

Loading Diagram...

Feature Scaling Impact

This TikZ diagram visualizes how standardization centers data around a common scale, which is essential for algorithms like SVM or Linear Regression.

\begin{tikzpicture} % Axes \draw[->] (-1,0) -- (4,0) node[right] {Feature Value}; \draw[->] (0,-1) -- (0,3) node[above] {Density};

code
% Unscaled distribution (wide and shifted) \draw[thick, blue] (1,0) .. controls (2,2) and (3,2) .. (4,0.2); \node[blue] at (3.5, 1.5) {Raw Data (Skewed)}; % Standardized distribution (centered at 0) \draw[thick, red] (-0.8,0) .. controls (0,2.5) and (0,2.5) .. (0.8,0); \node[red] at (1.2, 2.5) {Standardized (Mean=0)};

\end{tikzpicture}

Definition-Example Pairs

  • Term: Synthetic Data Generation

    • Definition: Creating artificial data points that mimic the statistical properties of the real dataset.
    • Example: If a fraud detection dataset only has 1% "fraud" cases, you generate synthetic fraud examples to help the model learn the pattern without being overwhelmed by "non-fraud" cases.
  • Term: Data Anonymization

    • Definition: The process of removing or modifying PII so that individuals cannot be identified.
    • Example: Replacing specific names in a medical dataset with generic IDs (e.g., "Patient_001") to comply with HIPAA regulations.

Worked Examples

1. Python Implementation: Standardization

Using scikit-learn to prepare a feature set (as referenced in the SageMaker workflow).

python
import pandas as pd from sklearn.preprocessing import StandardScaler # Sample dataset: Age and Yearly Income data = pd.DataFrame({'Age': [25, 45, 35, 50], 'Income': [50000, 120000, 80000, 150000]}) # Initialize scaler scaler = StandardScaler() # Fit and transform the data standardized_data = scaler.fit_transform(data) print(standardized_data) # Output will be centered around 0 with unit variance

2. AWS Glue DataBrew: Deduplication

  1. Connect: Connect DataBrew to your S3 bucket.
  2. Transform: Select the "Remove Duplicates" transform from the visual menu.
  3. Scope: Choose to remove duplicates based on a specific unique key (e.g., customer_id) or the entire row.
  4. Job: Run the DataBrew job to output the cleaned CSV back to S3 for training.

Checkpoint Questions

  1. Which AWS service provides a visual interface for data cleaning without requiring code?
    • Answer: AWS Glue DataBrew.
  2. What is the primary risk of having duplicate entries in a training dataset?
    • Answer: It introduces bias by over-representing specific entries, potentially leading to an imbalanced model.
  3. True or False: Feature engineering involves adding new external data to your dataset.
    • Answer: False. It involves extracting more information from existing data.
  4. What metric would you use to measure the difference in positive labels between two groups?
    • Answer: Difference in Proportions of Labels (DPL).

Muddy Points & Cross-Refs

  • Standardization vs. Normalization:
    • Standardization (Z-score) scales data based on the mean and standard deviation. Use this for algorithms that assume a Gaussian distribution (e.g., Linear Regression).
    • Normalization (Min-Max) scales data to a fixed range, usually [0, 1]. Use this when you have specific boundaries or image pixel values.
  • Bias Detection: While SageMaker Clarify is the primary tool for detecting bias, Glue Data Quality is better for technical integrity checks (e.g., "Is this column null?").

Comparison Tables

Data Preparation Tools Comparison

ToolPrimary UserBest For...Coding Required?
SageMaker NotebooksData ScientistsCustom, complex transformations (Pandas/PySpark).Yes (Python/R)
AWS Glue DataBrewML Engineers / AnalystsFast, visual cleaning and deduplication.No (Visual)
AWS Glue (ETL)Data EngineersLarge-scale production data pipelines.Yes (Spark/Python)

Structured vs. Unstructured Data

FeatureStructuredUnstructured
SchemaRigid, Pre-definedNone / Internal only
ExamplesSQL Tables, CSVJPG, MP4, PDF, Text
ML ApproachTabular Algorithms (XGBoost)Deep Learning (CNNs, LLMs)
SearchabilityEasy via queriesDifficult (requires indexing/embedding)

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free