Troubleshooting Data Ingestion and Storage: Capacity & Scalability
Troubleshooting and debugging data ingestion and storage issues that involve capacity and scalability
Troubleshooting Data Ingestion and Storage: Capacity & Scalability
This guide covers the critical skills required to identify, debug, and resolve performance bottlenecks in the collection phase of the Machine Learning lifecycle, specifically within the AWS ecosystem.
Learning Objectives
By the end of this study guide, you will be able to:
- Identify common capacity bottlenecks in Kinesis, S3, and DynamoDB.
- Differentiate between real-time and batch ingestion troubleshooting strategies.
- Select appropriate monitoring tools (CloudWatch, CloudTrail) for specific failure modes.
- Optimize storage cost and performance using lifecycle policies and volume right-sizing.
Key Terms & Glossary
- Throughput: The amount of data moved from one place to another in a given time period.
- Sharding: A method of splitting a data stream (like Kinesis) into multiple segments to increase parallel processing capacity.
- Partitioning: Organizing data (e.g., in S3 or Glue) into hierarchical folders to optimize query performance and reduce scan volume.
- Service Quotas: Regional limits imposed by AWS on resource usage (e.g., the number of running instances or S3 buckets).
- Backpressure: A phenomenon where a downstream system cannot keep up with the rate of incoming data, causing a bottleneck upstream.
The "Big Idea"
In the Machine Learning lifecycle, data ingestion and storage are the foundation. If these systems fail to scale or run out of capacity, downstream training and inference stop entirely. Scalability is the system's ability to handle growing amounts of work, while Capacity refers to the maximum amount that something can contain or produce. Troubleshooting involves finding where the "pipe" is too narrow for the "flow."
Formula / Concept Box
| Concept | Rule / Formula | Application |
|---|---|---|
| Kinesis Shard Capacity | 1MB/sec Ingress, 2MB/sec Egress | Used to calculate the number of shards needed based on total throughput. |
| S3 Partitioning | s3://bucket/year=YYYY/month=MM/ | Drastically reduces data scanned by tools like Athena or Glue. |
| DynamoDB RCU/WCU | $1 WCU = 1 KB/sec$ | Calculating throughput costs for metadata indexing. |
Hierarchical Outline
- I. Monitoring & Diagnostics
- CloudWatch Metrics: Monitoring CPU, Memory, I/O, and Throttle events.
- CloudTrail: Auditing API calls to find "Access Denied" or "Rate Exceeded" errors.
- II. Data Ingestion Troubleshooting
- Real-time (Kinesis/MSK): Resolving shard-level throttling and consumer lag.
- Batch (Glue/DataSync): Optimizing ETL job concurrency and partition logic.
- III. Storage Scalability
- Object (S3): Implementing lifecycle policies and VPC endpoints for throughput.
- Block (EBS): Right-sizing IOPS and choosing between GP3 and IO2 volumes.
- NoSQL (DynamoDB): Avoiding "Hot Keys" and minimizing expensive Scan operations.
Visual Anchors
Troubleshooting Flowchart
Scalability Concept: Sharding vs. Capacity
\begin{tikzpicture}[node distance=1cm, auto] \draw[thick, ->] (0,0) -- (6,0) node[anchor=north] {Time/Scale}; \draw[thick, ->] (0,0) -- (0,4) node[anchor=east] {Throughput}; % Shard 1 \draw[fill=blue!20] (0.5,0.5) rectangle (5,1.5); \node at (2.75, 1) {Shard 1 (Fixed Capacity)}; % Shard 2 added \draw[fill=green!20] (0.5,1.6) rectangle (5,2.6); \node at (2.75, 2.1) {Shard 2 (Horizontal Scaling)}; % Bottleneck line \draw[red, dashed, thick] (0, 3) -- (5.5, 3) node[right] {\small Bottleneck}; \end{tikzpicture}
Definition-Example Pairs
- Lifecycle Policy: Rules that automatically transition or delete data after a set period.
- Example: Moving raw training logs from S3 Standard to S3 Glacier after 30 days to save costs.
- Read Replicas: Copies of a database that handle read-only traffic.
- Example: Creating an Aurora Read Replica to handle a sudden surge in data exploration queries without slowing down the primary write engine.
- Hot Key: A partition key that receives a disproportionate amount of traffic.
- Example: In a Kinesis stream, if all sensor data for "Device_A" goes to one shard while other shards are empty, Shard 1 will throttle.
Worked Examples
Problem: Kinesis Data Stream Throttling
Scenario: A real-time ML pipeline is dropping data. CloudWatch shows ReadProvisionedThroughputExceeded.
- Analyze: Check the number of shards. Each shard supports 2MB/sec of read throughput.
- Diagnose: You have 3 consumers (Lambda, Firehose, and a custom EC2 app) reading from 1 shard. $3 \times \text{Throughput} > 2\text{MB/sec}$.
- Solution: Enable Enhanced Fan-Out for the consumers or increase the shard count.
Problem: Slow AWS Glue ETL Jobs
Scenario: A Glue job processing 1TB of CSV data takes 5 hours.
- Analyze: The data is stored in one massive S3 folder.
- Optimize: Convert the data to Apache Parquet (columnar) and partition by date.
- Result: The Glue job now only reads the specific partitions needed, reducing time to 20 minutes.
Comparison Tables
Ingestion Service Comparison
| Service | Type | Best For | Troubleshooting Focus |
|---|---|---|---|
| Kinesis Streams | Streaming | Low-latency, custom processing | Shard count & Partition Keys |
| Data Firehose | Streaming | Loading into S3/Redshift | Buffer size & intervals |
| AWS Glue | Batch | ETL, Schema Discovery | Worker type (G.1X/G.2X) |
| DataSync | Batch | On-prem to AWS migration | Network bandwidth |
Checkpoint Questions
- What CloudWatch metric would you monitor to detect if a DynamoDB table is under-provisioned for writes?
- Why is a "Scan" operation in DynamoDB considered less efficient than a "Query"?
- An S3 bucket is experiencing high latency for GET requests. What naming strategy can help improve performance?
- What is the main difference between a Service Quota and a Performance Bottleneck?
Muddy Points & Cross-Refs
- S3 Throughput Limits: Many students forget that S3 supports 3,500 PUT and 5,500 GET requests per second per prefix. If you hit these, you need to add more prefixes (random strings) to your folder structure.
- Kinesis vs. MSK: Kinesis is fully managed and easier to scale; MSK (Kafka) offers more control but requires more manual configuration for capacity planning.
- EBS Volumes: Note the difference between GP3 (baseline performance) and Provisioned IOPS (IO2) (guaranteed performance for high-load databases).