AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)
Describe vector index types (for example, HNSW, IVF)
Vector Indexing: HNSW and IVF for High-Dimensional Search
In the era of Generative AI and Large Language Models (LLMs), searching through massive datasets of unstructured data (text, images, audio) requires a new approach: Vector Search. This guide explores the two primary indexing types used in AWS services like Amazon Aurora (pgvector) and Amazon OpenSearch: HNSW and IVF.
Learning Objectives
By the end of this guide, you should be able to:
- Explain the concept of Approximate Nearest Neighbor (ANN) search.
- Describe the architecture and performance trade-offs of Hierarchical Navigable Small Worlds (HNSW).
- Explain how Inverted File Indexes (IVF) partition vector space to optimize search.
- Compare HNSW and IVF based on memory usage, query latency, and build time.
- Identify AWS services that support these indexing algorithms (e.g., Aurora PostgreSQL with pgvector).
Key Terms & Glossary
- Embedding: A numerical representation of data (like a word or image) in a high-dimensional vector space. Example: A 1536-dimensional vector representing the meaning of a sentence.
- Approximate Nearest Neighbor (ANN): An algorithm category that finds "close enough" vectors quickly rather than calculating exact distances for every item in a database.
- Centroid: The geometric center of a cluster of vectors in an IVF index.
- Recall: The percentage of the true nearest neighbors found by an ANN search. High recall means high accuracy.
- Dimensionality: The number of coordinates in a vector (e.g., 768 or 1536).
The "Big Idea"
Traditional databases use B-Trees to find exact matches or ranges of numbers. However, you cannot "sort" a vector. To find similar items, we must measure the distance between points in high-dimensional space. Since calculating distance against millions of vectors is too slow, we use Vector Indexes (HNSW/IVF) to create "shortcuts" that allow us to find similar items in milliseconds instead of seconds.
Formula / Concept Box
| Metric | Description | Formula / Logic |
|---|---|---|
| Euclidean Distance ($L_2) | Straight-line distance between two points. | d = \sqrt{\sum (q_i - v_i)^2} |
| Cosine Similarity | Measures the angle between two vectors (ignores magnitude). | \cos(\theta) = \frac{A \cdot B}{|A| |B|} |
| Inner Product (IP) | Measures both angle and magnitude. | A \cdot B = \sum A_i B_i$ |
Hierarchical Outline
- Vector Fundamentals
- Embeddings translate context into numbers.
- Vector databases store these numbers for semantic search.
- HNSW (Hierarchical Navigable Small Worlds)
- Graph-based index.
- Multi-layered approach for navigation.
- High performance but high memory usage.
- IVF (Inverted File Index)
- Partition-based index.
- Groups vectors into clusters (Voronoi cells).
- Uses a "centroid" to skip irrelevant clusters.
- AWS Implementation
- Amazon Aurora (pgvector): Supports both HNSW and IVF.
- Amazon OpenSearch: Native support for vector engines.
- Amazon Bedrock: Uses vector indexes in Knowledge Bases.
Visual Anchors
HNSW Layered Architecture
IVF Voronoi Partitioning
Definition-Example Pairs
- HNSW (Hierarchical Navigable Small Worlds): A graph-based index that connects points to their nearest neighbors across multiple layers.
- Example: Finding a specific house in a new city by first looking at a state map (Top Layer), then a city map (Middle), then a street map (Bottom).
- IVF (Inverted File Index): A technique that divides the vector space into buckets (Voronoi cells) and only searches the most relevant buckets.
- Example: To find a specific book, you go only to the "Science Fiction" section instead of checking every book in the library.
Worked Examples
Example 1: Implementing HNSW in Amazon Aurora PostgreSQL
To use vector search in Aurora, you must use the pgvector extension.
Step 1: Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;Step 2: Create a table with a vector column
CREATE TABLE product_embeddings (
id bigserial PRIMARY KEY,
embedding vector(1536) -- 1536 is the dimension for OpenAI/Bedrock models
);Step 3: Create an HNSW index
CREATE INDEX ON product_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);[!NOTE]
mis the number of connections per node, andef_constructiondetermines the search effort during index building.
Checkpoint Questions
- Which index type generally requires more RAM (Memory)?
- If your dataset is constantly changing with new inserts, is HNSW or IVF generally better for maintaining high recall without frequent re-training?
- What is the main purpose of the "Centroids" in an IVF index?
Comparison Tables
| Feature | HNSW (Graph-based) | IVF (List-based) |
|---|---|---|
| Search Speed | Extremely Fast | Fast (depends on nprobe) |
| Memory Usage | High (stores graph edges) | Low (compressed lists) |
| Training Required | No | Yes (to find centroids) |
| Recall/Accuracy | Excellent | Good (can be tuned) |
| Best Use Case | Low latency, high RAM available | Massive datasets, cost-constrained |
Muddy Points & Cross-Refs
- HNSW Memory Consumption: Students often forget that HNSW requires the index to be loaded into memory for optimal performance. If your instance is memory-constrained, IVF is the safer choice.
- IVF
nprobeParameter: This is a common point of confusion.nprobedefines how many clusters to search. Increasingnprobeincreases recall but also increases query time (latency). - Cross-Ref: See Amazon Bedrock Knowledge Bases for how AWS automates the creation of these indexes for RAG (Retrieval-Augmented Generation) applications.
[!TIP] For the AWS Exam: Remember that HNSW is usually the default recommendation for performance, while IVF is used for scaling to billion-scale datasets where memory is the bottleneck.