Vector Indexing: HNSW and IVF for High-Dimensional Search

In the era of Generative AI and Large Language Models (LLMs), searching through massive datasets of unstructured data (text, images, audio) requires a new approach: Vector Search. This guide explores the two primary indexing types used in AWS services like Amazon Aurora (pgvector) and Amazon OpenSearch: HNSW and IVF.

Learning Objectives

By the end of this guide, you should be able to:

Explain the concept of Approximate Nearest Neighbor (ANN) search.
Describe the architecture and performance trade-offs of Hierarchical Navigable Small Worlds (HNSW).
Explain how Inverted File Indexes (IVF) partition vector space to optimize search.
Compare HNSW and IVF based on memory usage, query latency, and build time.
Identify AWS services that support these indexing algorithms (e.g., Aurora PostgreSQL with pgvector).

Key Terms & Glossary

Embedding: A numerical representation of data (like a word or image) in a high-dimensional vector space. Example: A 1536-dimensional vector representing the meaning of a sentence.
Approximate Nearest Neighbor (ANN): An algorithm category that finds "close enough" vectors quickly rather than calculating exact distances for every item in a database.
Centroid: The geometric center of a cluster of vectors in an IVF index.
Recall: The percentage of the true nearest neighbors found by an ANN search. High recall means high accuracy.
Dimensionality: The number of coordinates in a vector (e.g., 768 or 1536).

The "Big Idea"

Traditional databases use B-Trees to find exact matches or ranges of numbers. However, you cannot "sort" a vector. To find similar items, we must measure the distance between points in high-dimensional space. Since calculating distance against millions of vectors is too slow, we use Vector Indexes (HNSW/IVF) to create "shortcuts" that allow us to find similar items in milliseconds instead of seconds.

Formula / Concept Box

Metric	Description	Formula / Logic
Euclidean Distance ($L_2)	Straight-line distance between two points.	d = \sqrt{\sum (q_i - v_i)^2}$
Cosine Similarity	Measures the angle between two vectors (ignores magnitude).	$\cos(\theta) = \frac{A \cdot B}{\\|A\\| \\|B\\|}$
Inner Product ($IP)	Measures both angle and magnitude.	A \cdot B = \sum A_i B_i$

Hierarchical Outline

Vector Fundamentals
- Embeddings translate context into numbers.
- Vector databases store these numbers for semantic search.
HNSW (Hierarchical Navigable Small Worlds)
- Graph-based index.
- Multi-layered approach for navigation.
- High performance but high memory usage.
IVF (Inverted File Index)
- Partition-based index.
- Groups vectors into clusters (Voronoi cells).
- Uses a "centroid" to skip irrelevant clusters.
AWS Implementation
- Amazon Aurora (pgvector): Supports both HNSW and IVF.
- Amazon OpenSearch: Native support for vector engines.
- Amazon Bedrock: Uses vector indexes in Knowledge Bases.

Visual Anchors

HNSW Layered Architecture

Loading Diagram...

IVF Voronoi Partitioning

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

HNSW (Hierarchical Navigable Small Worlds): A graph-based index that connects points to their nearest neighbors across multiple layers.
- Example: Finding a specific house in a new city by first looking at a state map (Top Layer), then a city map (Middle), then a street map (Bottom).
IVF (Inverted File Index): A technique that divides the vector space into buckets (Voronoi cells) and only searches the most relevant buckets.
- Example: To find a specific book, you go only to the "Science Fiction" section instead of checking every book in the library.

Worked Examples

Example 1: Implementing HNSW in Amazon Aurora PostgreSQL

To use vector search in Aurora, you must use the pgvector extension.

Step 1: Enable the extension

sql

CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Create a table with a vector column

sql

CREATE TABLE product_embeddings (
    id bigserial PRIMARY KEY,
    embedding vector(1536) -- 1536 is the dimension for OpenAI/Bedrock models
);

Step 3: Create an HNSW index

sql

CREATE INDEX ON product_embeddings 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 64);

[!NOTE] m is the number of connections per node, and ef_construction determines the search effort during index building.

Checkpoint Questions

Which index type generally requires more RAM (Memory)?
If your dataset is constantly changing with new inserts, is HNSW or IVF generally better for maintaining high recall without frequent re-training?
What is the main purpose of the "Centroids" in an IVF index?

Comparison Tables

Feature	HNSW (Graph-based)	IVF (List-based)
Search Speed	Extremely Fast	Fast (depends on `nprobe`)
Memory Usage	High (stores graph edges)	Low (compressed lists)
Training Required	No	Yes (to find centroids)
Recall/Accuracy	Excellent	Good (can be tuned)
Best Use Case	Low latency, high RAM available	Massive datasets, cost-constrained

Muddy Points & Cross-Refs

HNSW Memory Consumption: Students often forget that HNSW requires the index to be loaded into memory for optimal performance. If your instance is memory-constrained, IVF is the safer choice.
IVF nprobe Parameter: This is a common point of confusion. nprobe defines how many clusters to search. Increasing nprobe increases recall but also increases query time (latency).
Cross-Ref: See Amazon Bedrock Knowledge Bases for how AWS automates the creation of these indexes for RAG (Retrieval-Augmented Generation) applications.

[!TIP] For the AWS Exam: Remember that HNSW is usually the default recommendation for performance, while IVF is used for scaling to billion-scale datasets where memory is the bottleneck.

Vector Indexing: HNSW and IVF for High-Dimensional Search

Learning Objectives

By the end of this guide, you should be able to:

Explain the concept of Approximate Nearest Neighbor (ANN) search.
Describe the architecture and performance trade-offs of Hierarchical Navigable Small Worlds (HNSW).
Explain how Inverted File Indexes (IVF) partition vector space to optimize search.
Compare HNSW and IVF based on memory usage, query latency, and build time.
Identify AWS services that support these indexing algorithms (e.g., Aurora PostgreSQL with pgvector).

Key Terms & Glossary

Embedding: A numerical representation of data (like a word or image) in a high-dimensional vector space. Example: A 1536-dimensional vector representing the meaning of a sentence.
Approximate Nearest Neighbor (ANN): An algorithm category that finds "close enough" vectors quickly rather than calculating exact distances for every item in a database.
Centroid: The geometric center of a cluster of vectors in an IVF index.
Recall: The percentage of the true nearest neighbors found by an ANN search. High recall means high accuracy.
Dimensionality: The number of coordinates in a vector (e.g., 768 or 1536).

The "Big Idea"

Formula / Concept Box

Metric	Description	Formula / Logic
Euclidean Distance ($L_2)	Straight-line distance between two points.	d = \sqrt{\sum (q_i - v_i)^2}$
Cosine Similarity	Measures the angle between two vectors (ignores magnitude).	$\cos(\theta) = \frac{A \cdot B}{\\|A\\| \\|B\\|}$
Inner Product ($IP)	Measures both angle and magnitude.	A \cdot B = \sum A_i B_i$

Hierarchical Outline

Vector Fundamentals
- Embeddings translate context into numbers.
- Vector databases store these numbers for semantic search.
HNSW (Hierarchical Navigable Small Worlds)
- Graph-based index.
- Multi-layered approach for navigation.
- High performance but high memory usage.
IVF (Inverted File Index)
- Partition-based index.
- Groups vectors into clusters (Voronoi cells).
- Uses a "centroid" to skip irrelevant clusters.
AWS Implementation
- Amazon Aurora (pgvector): Supports both HNSW and IVF.
- Amazon OpenSearch: Native support for vector engines.
- Amazon Bedrock: Uses vector indexes in Knowledge Bases.

Visual Anchors

HNSW Layered Architecture

Loading Diagram...

IVF Voronoi Partitioning

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

HNSW (Hierarchical Navigable Small Worlds): A graph-based index that connects points to their nearest neighbors across multiple layers.
- Example: Finding a specific house in a new city by first looking at a state map (Top Layer), then a city map (Middle), then a street map (Bottom).
IVF (Inverted File Index): A technique that divides the vector space into buckets (Voronoi cells) and only searches the most relevant buckets.
- Example: To find a specific book, you go only to the "Science Fiction" section instead of checking every book in the library.

Worked Examples

Example 1: Implementing HNSW in Amazon Aurora PostgreSQL

To use vector search in Aurora, you must use the pgvector extension.

Step 1: Enable the extension

sql

CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Create a table with a vector column

sql

CREATE TABLE product_embeddings (
    id bigserial PRIMARY KEY,
    embedding vector(1536) -- 1536 is the dimension for OpenAI/Bedrock models
);

Step 3: Create an HNSW index

sql

CREATE INDEX ON product_embeddings 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 64);

[!NOTE] m is the number of connections per node, and ef_construction determines the search effort during index building.

Checkpoint Questions

Which index type generally requires more RAM (Memory)?
If your dataset is constantly changing with new inserts, is HNSW or IVF generally better for maintaining high recall without frequent re-training?
What is the main purpose of the "Centroids" in an IVF index?

Comparison Tables

Feature	HNSW (Graph-based)	IVF (List-based)
Search Speed	Extremely Fast	Fast (depends on `nprobe`)
Memory Usage	High (stores graph edges)	Low (compressed lists)
Training Required	No	Yes (to find centroids)
Recall/Accuracy	Excellent	Good (can be tuned)
Best Use Case	Low latency, high RAM available	Massive datasets, cost-constrained

Muddy Points & Cross-Refs

HNSW Memory Consumption: Students often forget that HNSW requires the index to be loaded into memory for optimal performance. If your instance is memory-constrained, IVF is the safer choice.
IVF nprobe Parameter: This is a common point of confusion. nprobe defines how many clusters to search. Increasing nprobe increases recall but also increases query time (latency).
Cross-Ref: See Amazon Bedrock Knowledge Bases for how AWS automates the creation of these indexes for RAG (Retrieval-Augmented Generation) applications.

[!TIP] For the AWS Exam: Remember that HNSW is usually the default recommendation for performance, while IVF is used for scaling to billion-scale datasets where memory is the bottleneck.

AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)

Vector Indexing: HNSW and IVF for High-Dimensional Search

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

HNSW Layered Architecture

IVF Voronoi Partitioning

Definition-Example Pairs

Worked Examples

Example 1: Implementing HNSW in Amazon Aurora PostgreSQL

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs

AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)

Vector Indexing: HNSW and IVF for High-Dimensional Search

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

HNSW Layered Architecture

IVF Voronoi Partitioning

Definition-Example Pairs

Worked Examples

Example 1: Implementing HNSW in Amazon Aurora PostgreSQL

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs