Troubleshooting Data Ingestion and Storage: Capacity & Scalability

This guide covers the critical skills required to identify, debug, and resolve performance bottlenecks in the collection phase of the Machine Learning lifecycle, specifically within the AWS ecosystem.

Learning Objectives

By the end of this study guide, you will be able to:

Identify common capacity bottlenecks in Kinesis, S3, and DynamoDB.
Differentiate between real-time and batch ingestion troubleshooting strategies.
Select appropriate monitoring tools (CloudWatch, CloudTrail) for specific failure modes.
Optimize storage cost and performance using lifecycle policies and volume right-sizing.

Key Terms & Glossary

Throughput: The amount of data moved from one place to another in a given time period.
Sharding: A method of splitting a data stream (like Kinesis) into multiple segments to increase parallel processing capacity.
Partitioning: Organizing data (e.g., in S3 or Glue) into hierarchical folders to optimize query performance and reduce scan volume.
Service Quotas: Regional limits imposed by AWS on resource usage (e.g., the number of running instances or S3 buckets).
Backpressure: A phenomenon where a downstream system cannot keep up with the rate of incoming data, causing a bottleneck upstream.

The "Big Idea"

In the Machine Learning lifecycle, data ingestion and storage are the foundation. If these systems fail to scale or run out of capacity, downstream training and inference stop entirely. Scalability is the system's ability to handle growing amounts of work, while Capacity refers to the maximum amount that something can contain or produce. Troubleshooting involves finding where the "pipe" is too narrow for the "flow."

Formula / Concept Box

Concept	Rule / Formula	Application
Kinesis Shard Capacity	1MB/sec Ingress, 2MB/sec Egress	Used to calculate the number of shards needed based on total throughput.
S3 Partitioning	`s3://bucket/year=YYYY/month=MM/`	Drastically reduces data scanned by tools like Athena or Glue.
DynamoDB RCU/WCU	$1 WCU = 1 KB/sec$	Calculating throughput costs for metadata indexing.

Hierarchical Outline

I. Monitoring & Diagnostics
- CloudWatch Metrics: Monitoring CPU, Memory, I/O, and Throttle events.
- CloudTrail: Auditing API calls to find "Access Denied" or "Rate Exceeded" errors.
II. Data Ingestion Troubleshooting
- Real-time (Kinesis/MSK): Resolving shard-level throttling and consumer lag.
- Batch (Glue/DataSync): Optimizing ETL job concurrency and partition logic.
III. Storage Scalability
- Object (S3): Implementing lifecycle policies and VPC endpoints for throughput.
- Block (EBS): Right-sizing IOPS and choosing between GP3 and IO2 volumes.
- NoSQL (DynamoDB): Avoiding "Hot Keys" and minimizing expensive Scan operations.

Visual Anchors

Troubleshooting Flowchart

Loading Diagram...

Scalability Concept: Sharding vs. Capacity

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Lifecycle Policy: Rules that automatically transition or delete data after a set period.
- Example: Moving raw training logs from S3 Standard to S3 Glacier after 30 days to save costs.
Read Replicas: Copies of a database that handle read-only traffic.
- Example: Creating an Aurora Read Replica to handle a sudden surge in data exploration queries without slowing down the primary write engine.
Hot Key: A partition key that receives a disproportionate amount of traffic.
- Example: In a Kinesis stream, if all sensor data for "Device_A" goes to one shard while other shards are empty, Shard 1 will throttle.

Worked Examples

Problem: Kinesis Data Stream Throttling

Scenario: A real-time ML pipeline is dropping data. CloudWatch shows ReadProvisionedThroughputExceeded.

Analyze: Check the number of shards. Each shard supports 2MB/sec of read throughput.
Diagnose: You have 3 consumers (Lambda, Firehose, and a custom EC2 app) reading from 1 shard. $3 \times \text{Throughput} > 2\text{MB/sec}$.
Solution: Enable Enhanced Fan-Out for the consumers or increase the shard count.

Problem: Slow AWS Glue ETL Jobs

Scenario: A Glue job processing 1TB of CSV data takes 5 hours.

Analyze: The data is stored in one massive S3 folder.
Optimize: Convert the data to Apache Parquet (columnar) and partition by date.
Result: The Glue job now only reads the specific partitions needed, reducing time to 20 minutes.

Comparison Tables

Ingestion Service Comparison

Service	Type	Best For	Troubleshooting Focus
Kinesis Streams	Streaming	Low-latency, custom processing	Shard count & Partition Keys
Data Firehose	Streaming	Loading into S3/Redshift	Buffer size & intervals
AWS Glue	Batch	ETL, Schema Discovery	Worker type (G.1X/G.2X)
DataSync	Batch	On-prem to AWS migration	Network bandwidth

Checkpoint Questions

What CloudWatch metric would you monitor to detect if a DynamoDB table is under-provisioned for writes?
Why is a "Scan" operation in DynamoDB considered less efficient than a "Query"?
An S3 bucket is experiencing high latency for GET requests. What naming strategy can help improve performance?
What is the main difference between a Service Quota and a Performance Bottleneck?

Muddy Points & Cross-Refs

S3 Throughput Limits: Many students forget that S3 supports 3,500 PUT and 5,500 GET requests per second per prefix. If you hit these, you need to add more prefixes (random strings) to your folder structure.
Kinesis vs. MSK: Kinesis is fully managed and easier to scale; MSK (Kafka) offers more control but requires more manual configuration for capacity planning.
EBS Volumes: Note the difference between GP3 (baseline performance) and Provisioned IOPS (IO2) (guaranteed performance for high-load databases).