Data Optimization: Indexing, Partitioning, and Compression Strategies
Describe best practices for indexing, partitioning strategies, compression, and other data optimization techniques
Data Optimization: Indexing, Partitioning, and Compression Strategies
This guide covers the essential techniques for optimizing data stores in AWS, focusing on Amazon S3, Redshift, and DynamoDB. Mastering these strategies is critical for the AWS Certified Data Engineer – Associate exam, specifically for improving query performance and reducing storage costs.
Learning Objectives
By the end of this guide, you should be able to:
- Distinguish between partitioning and bucketing in an S3 Data Lake.
- Select appropriate compression algorithms based on performance and storage requirements.
- Implement sort keys and distribution styles in Amazon Redshift.
- Apply file size optimization and columnar formats to improve query speed in Athena.
- Identify how predicate pushdown reduces I/O overhead.
Key Terms & Glossary
- Cardinality: The uniqueness of data values in a column. High cardinality means many unique values (e.g., User ID); low cardinality means few unique values (e.g., Country).
- Splittability: The ability of a file to be divided and processed in parallel by different compute nodes. Columnar formats like Parquet are natively splittable.
- Predicate Pushdown: An optimization where filtering (the
WHEREclause) is moved as close to the data source as possible to avoid reading unnecessary data. - Data Skew: An uneven distribution of data across partitions or nodes, leading to "hot spots" where one resource does more work than others.
- GSI (Global Secondary Index): An index in DynamoDB with a partition key and a sort key that can be different from those on the base table.
The "Big Idea"
[!IMPORTANT] The primary goal of data optimization in a cloud environment is Data Pruning. Whether through partitioning, indexing, or columnar storage, every technique aims to minimize the amount of data scanned. Since services like Amazon Athena and Redshift Spectrum charge based on data scanned, optimization directly translates to both speed and cost savings.
Formula / Concept Box
| Optimization Factor | Target Metric / Best Practice |
|---|---|
| Optimal File Size (S3) | ~128 MB (Balance between metadata overhead and parallelism) |
| Partition Key Selection | Work backward from query filters (use low cardinality fields) |
| Bucketing Candidate | High cardinality columns (IDs) frequently used in joins |
| Redshift Encoding | Use ENCODE AUTO unless manually optimizing sort keys |
Hierarchical Outline
- S3 Data Lake Optimization
- Partitioning: Physical separation by folder (e.g.,
/year=2023/month=10/). - Bucketing: Hash-based distribution within partitions.
- File Formats: Moving from row-based (CSV/JSON) to columnar (Parquet/ORC).
- Partitioning: Physical separation by folder (e.g.,
- Compression Techniques
- High Speed: Snappy (best for intermediate Spark/Glue jobs).
- Balanced: zstd (Zstandard) for general-purpose storage.
- High Ratio: GZIP (slowest, but smallest footprint).
- Amazon Redshift Physical Modeling
- Compound Sort Keys: Best for consistent leading-column filters.
- Interleaved Sort Keys: Best for flexible filtering across multiple columns.
- Distribution Styles: AUTO, EVEN, KEY, ALL.
- Query Engine Optimizations
- Predicate Pushdown: Filtering at the storage layer.
- Vectorization: Using SIMD to process multiple data points in one CPU cycle.
Visual Anchors
Data Pruning Flow
Storage Architecture Comparison
\begin{tikzpicture}[node distance=2cm] \draw[thick] (0,0) rectangle (2,3) node[midway] {\begin{tabular}{c} Row-based \ (CSV) \ \hline R1: A,B,C \ R2: A,B,C \end{tabular}}; \draw[thick, fill=blue!10] (4,0) rectangle (6,3) node[midway] {\begin{tabular}{c} Columnar \ (Parquet) \ \hline A: 1,2,3 \ B: X,Y,Z \end{tabular}}; \draw[->, thick] (2.2,1.5) -- (3.8,1.5) node[midway, above] {Optimize}; \node at (1,-0.5) {Linear Scan}; \node at (5,-0.5) {Selective Read}; \end{tikzpicture}
Definition-Example Pairs
- Compound Sort Key: A sort key composed of multiple columns where order matters.
- Example: Sorting by
(region, date). Queries filtering byregionorregion AND dateare fast, but queries filtering only bydategain little benefit.
- Example: Sorting by
- Partition Projection: A feature in AWS Glue/Athena that calculates partition metadata from a pattern rather than querying the Glue Catalog.
- Example: In a dataset partitioned by
day, instead of scanning the catalog for thousands of folders, Athena calculates the path based on the date range provided.
- Example: In a dataset partitioned by
- Snappy Compression: A non-splittable (usually) but high-speed compression algorithm.
- Example: Used as the default compression for Parquet files in Spark to ensure fast temporary writes during ETL.
Worked Examples
Scenario 1: Optimizing an Athena Query
Problem: A query against a 5TB S3 bucket containing CSV logs is taking 10 minutes and costing $25 per run. Solution Steps:
- Convert to Parquet: Use AWS Glue to convert CSV to Parquet. This enables columnar reads.
- Partition by Date: Organize S3 keys as
s3://bucket/logs/year=YYYY/month=MM/day=DD/. - Compress with Snappy: Reduce file footprint by ~60%. Outcome: The query now only scans the specific date folder and only the columns requested, reducing time to 20 seconds and cost to < $1.
Scenario 2: Redshift Join Optimization
Problem: Two large tables, Sales and Customers, are frequently joined on customer_id, but the query is performing a slow Hash Join.
Solution:
- Set
customer_idas the Distribution Key (DISTSTYLE KEY) for both tables. - Set
customer_idas the first column in the Sort Key for both tables. Outcome: Redshift can now perform a Sort Merge Join, which is significantly faster as data is already co-located and ordered on the same nodes.
Comparison Tables
| Feature | Partitioning | Bucketing |
|---|---|---|
| Key Characteristic | Low cardinality (dates, regions) | High cardinality (User IDs, Device IDs) |
| Physical Structure | Creates separate folders (prefixes) | Creates a fixed number of files in a folder |
| Primary Benefit | Eliminates folders from scan | Optimizes joins and point lookups |
| Common Mistake | Partitioning by UserID (too many folders) | Creating too few buckets for data size |
| Format | Row/Column | Splittable | Best Use Case |
|---|---|---|---|
| CSV | Row | No (if compressed) | Raw data ingestion |
| Parquet | Columnar | Yes | Analytical queries (OLAP) |
| ORC | Columnar | Yes | Highly optimized for Hive/Hadoop |
Checkpoint Questions
- Why is 128MB considered the "optimum" file size for S3 data lakes?
- In Amazon Redshift, why should you avoid compressing sort key columns?
- Which compression algorithm provides the best balance between performance and ratio for long-term storage?
- What happens to query performance if your partition key has high cardinality (e.g., millions of unique values)?
Muddy Points & Cross-Refs
- Partitioning vs. Indexing: In NoSQL (DynamoDB), you define Indexes (GSIs). In S3, you use Partitioning. While they both "locate data," Partitioning is a physical storage layout change, whereas Indexing is a secondary data structure.
- Splittability: Remember that while GZIP has a high compression ratio, it is not splittable. If you have one massive 10GB GZIP file, only one core can process it at a time. Use Parquet with Snappy or BZip2 if you need parallelism.
- Further Study: Cross-reference this with AWS Glue Data Catalog and Athena Partition Projection documentation for implementation details.