Machine Learning Feasibility: Data Assessment and Problem Complexity
Assessing available data and problem complexity to determine the feasibility of an ML solution
Machine Learning Feasibility: Data Assessment and Problem Complexity
This guide focuses on the critical first phase of the machine learning lifecycle: determining if a problem is suitable for ML and whether the existing data can support a viable solution. This is a core competency for the AWS Certified Machine Learning Engineer (Associate) exam.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between problems requiring deterministic algorithms and those requiring probabilistic ML models.
- Assess data quality and availability to determine if an ML model can be trained effectively.
- Evaluate problem complexity based on latency, scalability, and resource requirements.
- Establish performance baselines using simple models to justify complex ML implementations.
- Identify regulatory and ethical constraints (e.g., PII, PHI) that impact feasibility.
Key Terms & Glossary
- Deterministic: A system where the same input always produces the exact same output via explicit rules.
- Probabilistic: A system that relies on statistical patterns and likelihoods (standard for ML).
- GIGO (Garbage In, Garbage Out): The principle that the quality of output is determined by the quality of the input data.
- Target Variable (Label): The specific outcome or value the model is trying to predict.
- Latency: The time taken for a model to provide a prediction after receiving input.
- Data Residency: Physical or geographic location of where data is stored, often dictated by law.
The "Big Idea"
Not every business problem requires Machine Learning. Traditional programming uses Rules + Data → Answers. Machine Learning flips this: Answers + Data → Rules. Feasibility assessment is the process of proving that (1) a pattern actually exists in the data, (2) you have enough high-quality data to find it, and (3) the cost of finding it is lower than the business value it provides.
Formula / Concept Box
| Concept | Description / Formula |
|---|---|
| Success Metric | Must be quantifiable (e.g., "Reduce churn by 10%" not "Improve customer happiness"). |
| Data Split Ratio | Standard starting point: 70% Training / 15% Validation / 15% Testing. |
| Bias Metric (CI) | Class Imbalance: (Measures if one class dominates the dataset). |
Hierarchical Outline
- Problem Definition & Framing
- Business Goal: Identify the specific opportunity (e.g., Fraud Detection).
- ML Framing: Translate goal into a technical task (e.g., Binary Classification).
- Data Feasibility Assessment
- Availability: Do we have the data? Is it accessible in AWS (S3, RDS)?
- Quality: Check for missing values, outliers, and noise.
- Integrity: Ensure representative sampling to avoid selection bias.
- Complexity & Constraints
- Inference Requirements: Real-time (low latency) vs. Batch processing.
- Resources: CPU/GPU availability and budget for training.
- Regulatory: Handling PII/PHI and interpretability needs.
- Baseline Establishment
- Start with Simple Models (Linear/Logistic Regression).
- Compare complex models against this baseline to measure ROI.
Visual Anchors
ML Feasibility Decision Flow
Data Value vs. Complexity
\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Data Complexity}; \draw[->] (0,0) -- (0,6) node[above] {Model Value}; \draw[thick, blue] (0,1) .. controls (2,1.5) and (4,4) .. (5.5,5.5); \node at (3,2) [rotate=35] {ML Advantage}; \draw[dashed] (0,1) -- (5.5,1) node[right] {Baseline (Simple Stats)}; \end{tikzpicture}
Definition-Example Pairs
- Feature Engineering: The process of transforming raw data into formats that better represent the underlying problem.
- Example: Converting a "Timestamp" into "Day of the Week" to help a model predict weekend sales spikes.
- Interpretability: The degree to which a human can understand the cause of a decision.
- Example: A bank using a Decision Tree for loan approvals because they must explain to customers why a loan was denied.
- Scalability: The ability to handle increasing volumes of data without a performance drop.
- Example: Using Amazon SageMaker Linear Learner because it can scale to multi-terabyte datasets more efficiently than a local Python script.
Worked Examples
Case Study: Coffee Shop Churn Prediction
1. Business Problem: A coffee shop wants to prevent customers from leaving for competitors. 2. Framing: This is a Binary Classification problem. Prediction: Will the customer return in the next 30 days? (Yes/No). 3. Data Assessment: - Inputs: Transaction history (frequency, spend), loyalty app logs, time since last visit. - Feasibility Check: If the shop only has "Total Daily Revenue" but no customer IDs, ML is not feasible because there is no way to link behavior to individuals. 4. Baseline: Use a simple rule: "If a customer hasn't visited in 14 days, they have churned." If a Random Forest model can't beat this simple logic, the ML solution is not worth the cost.
Checkpoint Questions
- What is the main difference between a deterministic and a probabilistic approach?
- Why should you start with a simple model (like Linear Regression) before moving to Deep Learning?
- What AWS tool would you use to identify pre-training bias such as class imbalance?
- If your application requires results in under 50ms, what constraint are you assessing?
▶Click to see answers
- Deterministic uses fixed rules; Probabilistic uses statistical patterns/likelihoods.
- To establish a performance baseline and determine if added complexity provides enough ROI.
- SageMaker Clarify.
- Latency (Real-time inference feasibility).
Muddy Points & Cross-Refs
- AI Services vs. Custom ML: You don't always need to build a model. If the task is "Extract text from images," it is more feasible to use Amazon Rekognition (AI Service) than to train a custom CNN.
- Data Residency: Even if ML is technically feasible, legal requirements (like GDPR) might prevent you from moving data to a specific AWS region for training.
- Synthetic Data: If you lack enough data, you can use synthetic data generation, but use it with caution as it may not capture real-world noise accurately.
Comparison Tables
Traditional Programming vs. Machine Learning
| Feature | Traditional Programming | Machine Learning |
|---|---|---|
| Logic Source | Human-written rules | Data-driven patterns |
| Best For | Calculations, fixed workflows | Predictions, Natural Language, Vision |
| Adaptability | Hard-coded; requires manual update | Learns from new data continuously |
| Complexity | Linear | Often non-linear and high |
Data Formats for Ingestion
| Format | Best For | AWS Tool Advantage |
|---|---|---|
| Parquet | Large scale, columnar access | Efficient for S3 and Glue Crawler |
| CSV | Small datasets, human readability | Easy to inspect in DataBrew |
| JSON | Semi-structured data | Native for many NoSQL/App sources |