Data Optimization: Indexing, Partitioning, and Compression Strategies

This guide covers the essential techniques for optimizing data stores in AWS, focusing on Amazon S3, Redshift, and DynamoDB. Mastering these strategies is critical for the AWS Certified Data Engineer – Associate exam, specifically for improving query performance and reducing storage costs.

Learning Objectives

By the end of this guide, you should be able to:

Distinguish between partitioning and bucketing in an S3 Data Lake.
Select appropriate compression algorithms based on performance and storage requirements.
Implement sort keys and distribution styles in Amazon Redshift.
Apply file size optimization and columnar formats to improve query speed in Athena.
Identify how predicate pushdown reduces I/O overhead.

Key Terms & Glossary

Cardinality: The uniqueness of data values in a column. High cardinality means many unique values (e.g., User ID); low cardinality means few unique values (e.g., Country).
Splittability: The ability of a file to be divided and processed in parallel by different compute nodes. Columnar formats like Parquet are natively splittable.
Predicate Pushdown: An optimization where filtering (the WHERE clause) is moved as close to the data source as possible to avoid reading unnecessary data.
Data Skew: An uneven distribution of data across partitions or nodes, leading to "hot spots" where one resource does more work than others.
GSI (Global Secondary Index): An index in DynamoDB with a partition key and a sort key that can be different from those on the base table.

The "Big Idea"

[!IMPORTANT] The primary goal of data optimization in a cloud environment is Data Pruning. Whether through partitioning, indexing, or columnar storage, every technique aims to minimize the amount of data scanned. Since services like Amazon Athena and Redshift Spectrum charge based on data scanned, optimization directly translates to both speed and cost savings.

Formula / Concept Box

Optimization Factor	Target Metric / Best Practice
Optimal File Size (S3)	~128 MB (Balance between metadata overhead and parallelism)
Partition Key Selection	Work backward from query filters (use low cardinality fields)
Bucketing Candidate	High cardinality columns (IDs) frequently used in joins
Redshift Encoding	Use `ENCODE AUTO` unless manually optimizing sort keys

Hierarchical Outline

S3 Data Lake Optimization
- Partitioning: Physical separation by folder (e.g., /year=2023/month=10/).
- Bucketing: Hash-based distribution within partitions.
- File Formats: Moving from row-based (CSV/JSON) to columnar (Parquet/ORC).
Compression Techniques
- High Speed: Snappy (best for intermediate Spark/Glue jobs).
- Balanced: zstd (Zstandard) for general-purpose storage.
- High Ratio: GZIP (slowest, but smallest footprint).
Amazon Redshift Physical Modeling
- Compound Sort Keys: Best for consistent leading-column filters.
- Interleaved Sort Keys: Best for flexible filtering across multiple columns.
- Distribution Styles: AUTO, EVEN, KEY, ALL.
Query Engine Optimizations
- Predicate Pushdown: Filtering at the storage layer.
- Vectorization: Using SIMD to process multiple data points in one CPU cycle.

Visual Anchors

Data Pruning Flow

Loading Diagram...

Storage Architecture Comparison

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Compound Sort Key: A sort key composed of multiple columns where order matters.
- Example: Sorting by (region, date). Queries filtering by region or region AND date are fast, but queries filtering only by date gain little benefit.
Partition Projection: A feature in AWS Glue/Athena that calculates partition metadata from a pattern rather than querying the Glue Catalog.
- Example: In a dataset partitioned by day, instead of scanning the catalog for thousands of folders, Athena calculates the path based on the date range provided.
Snappy Compression: A non-splittable (usually) but high-speed compression algorithm.
- Example: Used as the default compression for Parquet files in Spark to ensure fast temporary writes during ETL.

Worked Examples

Scenario 1: Optimizing an Athena Query

Problem: A query against a 5TB S3 bucket containing CSV logs is taking 10 minutes and costing $25 per run. Solution Steps:

Convert to Parquet: Use AWS Glue to convert CSV to Parquet. This enables columnar reads.
Partition by Date: Organize S3 keys as s3://bucket/logs/year=YYYY/month=MM/day=DD/.
Compress with Snappy: Reduce file footprint by ~60%. Outcome: The query now only scans the specific date folder and only the columns requested, reducing time to 20 seconds and cost to < $1.

Scenario 2: Redshift Join Optimization

Problem: Two large tables, Sales and Customers, are frequently joined on customer_id, but the query is performing a slow Hash Join. Solution:

Set customer_id as the Distribution Key (DISTSTYLE KEY) for both tables.
Set customer_id as the first column in the Sort Key for both tables. Outcome: Redshift can now perform a Sort Merge Join, which is significantly faster as data is already co-located and ordered on the same nodes.

Comparison Tables

Feature	Partitioning	Bucketing
Key Characteristic	Low cardinality (dates, regions)	High cardinality (User IDs, Device IDs)
Physical Structure	Creates separate folders (prefixes)	Creates a fixed number of files in a folder
Primary Benefit	Eliminates folders from scan	Optimizes joins and point lookups
Common Mistake	Partitioning by UserID (too many folders)	Creating too few buckets for data size

Format	Row/Column	Splittable	Best Use Case
CSV	Row	No (if compressed)	Raw data ingestion
Parquet	Columnar	Yes	Analytical queries (OLAP)
ORC	Columnar	Yes	Highly optimized for Hive/Hadoop

Checkpoint Questions

Why is 128MB considered the "optimum" file size for S3 data lakes?
In Amazon Redshift, why should you avoid compressing sort key columns?
Which compression algorithm provides the best balance between performance and ratio for long-term storage?
What happens to query performance if your partition key has high cardinality (e.g., millions of unique values)?

Muddy Points & Cross-Refs

Partitioning vs. Indexing: In NoSQL (DynamoDB), you define Indexes (GSIs). In S3, you use Partitioning. While they both "locate data," Partitioning is a physical storage layout change, whereas Indexing is a secondary data structure.
Splittability: Remember that while GZIP has a high compression ratio, it is not splittable. If you have one massive 10GB GZIP file, only one core can process it at a time. Use Parquet with Snappy or BZip2 if you need parallelism.
Further Study: Cross-reference this with AWS Glue Data Catalog and Athena Partition Projection documentation for implementation details.