DynamoDB Data Retrieval: Query vs. Scan
Describe differences between query and scan operations
DynamoDB Data Retrieval: Query vs. Scan
Understanding the choice between a Query and a Scan operation is fundamental to building scalable, cost-effective applications on Amazon DynamoDB. This guide breaks down the mechanics, performance implications, and best practices for each.
Learning Objectives
After studying this guide, you should be able to:
- Explain the fundamental mechanical difference between a Scan and a Query.
- Evaluate the impact of each operation on Read Capacity Units (RCUs) and cost.
- Identify when to use
FilterExpressionandProjectionExpressioneffectively. - Determine the appropriate retrieval method based on specific application access patterns.
- Understand the constraints and optimization techniques (e.g., Parallel Scan).
Key Terms & Glossary
- Scan: An operation that examines every item in a table or secondary index.
- Query: An operation that finds items based on a Partition Key and an optional Sort Key.
- RCU (Read Capacity Unit): The unit of cost for reading data; Query operations typically consume significantly fewer than Scans.
- ProjectionExpression: A string that identifies the specific attributes you want to retrieve, reducing the size of the returned data payload.
- FilterExpression: A string used to discard items after they have been read but before they are returned to the application.
- Sequential Scan: The default method where DynamoDB processes data in a single thread.
- Parallel Scan: A method where a table is divided into segments and scanned concurrently by multiple threads or processes.
The "Big Idea"
In DynamoDB, Query is a precision instrument, while Scan is a sledgehammer. If you know the "address" (Primary Key) of your data, you should always use a Query. A Scan is essentially a full-table read that becomes increasingly slow and expensive as your data grows. Choosing the wrong one can lead to performance bottlenecks and unexpected AWS bills.
Formula / Concept Box
| Feature | Scan Operation | Query Operation |
|---|---|---|
| Search Criteria | Any attribute (unstructured) | Primary Key (Partition Key required) |
| Data Accessed | Entire Table/Index | Targeted Items within a Partition |
| Efficiency | Low (Full read) | High (Direct access) |
| RCU Consumption | High (Based on total data read) | Low (Based on items matching PK) |
| Best Use Case | Reporting, data migration, small tables | Real-time apps, specific record lookups |
| Size Limit | 1 MB per request | 1 MB per request |
Hierarchical Outline
- The Scan Operation
- Mechanics: Accesses every item in the table.
- Filtering:
FilterExpressiondoes not save RCUs; it only trims the result set sent to the client. - Efficiency: Consumes RCUs for every item in the table, regardless of filters.
- Parallelism: Use
SegmentandTotalSegmentsto speed up large scans.
- The Query Operation
- Mechanics: Requires a specific Partition Key value.
- Refinement: Can use a Sort Key with comparison operators (e.g.,
begins_with,between). - Efficiency: Only consumes RCUs for the specific items retrieved within the partition.
- Optimization Strategies
- Projection: Use
ProjectionExpressionfor both to minimize data transfer size. - Consistency: Both default to Eventually Consistent reads; set
ConsistentReadtotruefor Strong Consistency.
- Projection: Use
Visual Anchors
Retrieval Decision Logic
Scanning vs. Querying the Data Blocks
\begin{tikzpicture}[scale=0.8] % Scan side \draw[thick] (0,4) node[above] {\textbf{Scan (Full Table)}}; \foreach \y in {0,0.8,1.6,2.4,3.2} \draw[fill=red!20] (0,\y) rectangle (3,\y+0.6); \draw[->, thick, red] (-0.5,3.8) -- (-0.5,0); \node[red, rotate=90] at (-0.8,1.9) {Reads Every Block};
% Query side
\draw[thick] (6,4) node[above] {\textbf{Query (Partitioned)}};
\foreach \y in {0,0.8,1.6,2.4,3.2}
\draw (6,\y) rectangle (9,\y+0.6);
\draw[fill=green!20] (6,1.6) rectangle (9,2.2);
\draw[->, thick, green!60!black] (10,1.9) -- (9.2,1.9);
\node[green!60!black] at (11.5,1.9) {Targeted PK};\end{tikzpicture}
Definition-Example Pairs
- Operation: Query
- Definition: Finding items sharing a partition key, optionally filtered by a sort key.
- Example: Looking up all "Orders" for "CustomerID: 12345".
- Operation: Scan
- Definition: A linear search through the entire dataset.
- Example: Finding all products in a warehouse that have a "status" of "damaged" when "status" is not a key.
- Parameter: FilterExpression
- Definition: A post-read filter that reduces the payload but not the cost.
- Example: Scanning a table for users and using a filter to only return those with
age > 21. (You still pay to read the users under 21).
Worked Examples
Example 1: Efficiency Comparison
Scenario: A table has 10,000 items. Each item is 1 KB. You want to find 10 items belonging to a specific user.
- Method A (Scan): DynamoDB reads all 10,000 items (10 MB). It then filters out 9,990 items and returns 10. You are charged RCUs for reading 10 MB of data.
- Method B (Query): You provide the UserID (Partition Key). DynamoDB goes directly to the partition and reads only the 10 items (10 KB). You are charged RCUs for 10 KB.
Example 2: The Parallel Scan
Scenario: You must perform a maintenance scan on a 50 GB table to update a specific attribute.
- Sequential Scan: One thread reads the 50 GB block by block. It takes 2 hours.
- Parallel Scan: You set
TotalSegmentsto 4. You run 4 worker processes, each reading a 12.5 GB segment simultaneously. The task completes in approximately 30 minutes (assuming sufficient RCU capacity).
Checkpoint Questions
- Does using a
FilterExpressionin a Scan operation reduce the number of RCUs consumed? (Answer: No, it only reduces the network payload). - What is the mandatory attribute required to perform a
Query? (Answer: The Partition Key). - What is the maximum amount of data a single Scan or Query request can return? (Answer: 1 MB).
- When should you use a
Parallel Scaninstead of a standard Scan? (Answer: When the table is large and you have spare RCU capacity to increase throughput). - How can you restrict a Query to only return specific attributes (like
EmailandName)? (Answer: By using aProjectionExpression).