Machine Learning Feasibility: Data Assessment and Problem Complexity

This guide focuses on the critical first phase of the machine learning lifecycle: determining if a problem is suitable for ML and whether the existing data can support a viable solution. This is a core competency for the AWS Certified Machine Learning Engineer (Associate) exam.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between problems requiring deterministic algorithms and those requiring probabilistic ML models.
Assess data quality and availability to determine if an ML model can be trained effectively.
Evaluate problem complexity based on latency, scalability, and resource requirements.
Establish performance baselines using simple models to justify complex ML implementations.
Identify regulatory and ethical constraints (e.g., PII, PHI) that impact feasibility.

Key Terms & Glossary

Deterministic: A system where the same input always produces the exact same output via explicit rules.
Probabilistic: A system that relies on statistical patterns and likelihoods (standard for ML).
GIGO (Garbage In, Garbage Out): The principle that the quality of output is determined by the quality of the input data.
Target Variable (Label): The specific outcome or value the model is trying to predict.
Latency: The time taken for a model to provide a prediction after receiving input.
Data Residency: Physical or geographic location of where data is stored, often dictated by law.

The "Big Idea"

Not every business problem requires Machine Learning. Traditional programming uses Rules + Data → Answers. Machine Learning flips this: Answers + Data → Rules. Feasibility assessment is the process of proving that (1) a pattern actually exists in the data, (2) you have enough high-quality data to find it, and (3) the cost of finding it is lower than the business value it provides.

Formula / Concept Box

Concept	Description / Formula
Success Metric	Must be quantifiable (e.g., "Reduce churn by 10%" not "Improve customer happiness").
Data Split Ratio	Standard starting point: 70% Training / 15% Validation / 15% Testing.
Bias Metric (CI)	Class Imbalance: $CI = \frac{n_a - n_b}{n_a + n_b}$ (Measures if one class dominates the dataset).

Hierarchical Outline

Problem Definition & Framing
- Business Goal: Identify the specific opportunity (e.g., Fraud Detection).
- ML Framing: Translate goal into a technical task (e.g., Binary Classification).
Data Feasibility Assessment
- Availability: Do we have the data? Is it accessible in AWS (S3, RDS)?
- Quality: Check for missing values, outliers, and noise.
- Integrity: Ensure representative sampling to avoid selection bias.
Complexity & Constraints
- Inference Requirements: Real-time (low latency) vs. Batch processing.
- Resources: CPU/GPU availability and budget for training.
- Regulatory: Handling PII/PHI and interpretability needs.
Baseline Establishment
- Start with Simple Models (Linear/Logistic Regression).
- Compare complex models against this baseline to measure ROI.

Visual Anchors

ML Feasibility Decision Flow

Loading Diagram...

Data Value vs. Complexity

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Feature Engineering: The process of transforming raw data into formats that better represent the underlying problem.
- Example: Converting a "Timestamp" into "Day of the Week" to help a model predict weekend sales spikes.
Interpretability: The degree to which a human can understand the cause of a decision.
- Example: A bank using a Decision Tree for loan approvals because they must explain to customers why a loan was denied.
Scalability: The ability to handle increasing volumes of data without a performance drop.
- Example: Using Amazon SageMaker Linear Learner because it can scale to multi-terabyte datasets more efficiently than a local Python script.

Worked Examples

Case Study: Coffee Shop Churn Prediction

1. Business Problem: A coffee shop wants to prevent customers from leaving for competitors. 2. Framing: This is a Binary Classification problem. Prediction: Will the customer return in the next 30 days? (Yes/No). 3. Data Assessment: - Inputs: Transaction history (frequency, spend), loyalty app logs, time since last visit. - Feasibility Check: If the shop only has "Total Daily Revenue" but no customer IDs, ML is not feasible because there is no way to link behavior to individuals. 4. Baseline: Use a simple rule: "If a customer hasn't visited in 14 days, they have churned." If a Random Forest model can't beat this simple logic, the ML solution is not worth the cost.

Checkpoint Questions

What is the main difference between a deterministic and a probabilistic approach?
Why should you start with a simple model (like Linear Regression) before moving to Deep Learning?
What AWS tool would you use to identify pre-training bias such as class imbalance?
If your application requires results in under 50ms, what constraint are you assessing?

▶Click to see answers

Deterministic uses fixed rules; Probabilistic uses statistical patterns/likelihoods.
To establish a performance baseline and determine if added complexity provides enough ROI.
SageMaker Clarify.
Latency (Real-time inference feasibility).

Muddy Points & Cross-Refs

AI Services vs. Custom ML: You don't always need to build a model. If the task is "Extract text from images," it is more feasible to use Amazon Rekognition (AI Service) than to train a custom CNN.
Data Residency: Even if ML is technically feasible, legal requirements (like GDPR) might prevent you from moving data to a specific AWS region for training.
Synthetic Data: If you lack enough data, you can use synthetic data generation, but use it with caution as it may not capture real-world noise accurately.

Comparison Tables

Traditional Programming vs. Machine Learning

Feature	Traditional Programming	Machine Learning
Logic Source	Human-written rules	Data-driven patterns
Best For	Calculations, fixed workflows	Predictions, Natural Language, Vision
Adaptability	Hard-coded; requires manual update	Learns from new data continuously
Complexity	Linear	Often non-linear and high

Data Formats for Ingestion

Format	Best For	AWS Tool Advantage
Parquet	Large scale, columnar access	Efficient for S3 and Glue Crawler
CSV	Small datasets, human readability	Easy to inspect in DataBrew
JSON	Semi-structured data	Native for many NoSQL/App sources