Study Guide890 words

AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)

Describe vector index types (for example, HNSW, IVF)

Vector Indexing: HNSW and IVF for High-Dimensional Search

In the era of Generative AI and Large Language Models (LLMs), searching through massive datasets of unstructured data (text, images, audio) requires a new approach: Vector Search. This guide explores the two primary indexing types used in AWS services like Amazon Aurora (pgvector) and Amazon OpenSearch: HNSW and IVF.


Learning Objectives

By the end of this guide, you should be able to:

  • Explain the concept of Approximate Nearest Neighbor (ANN) search.
  • Describe the architecture and performance trade-offs of Hierarchical Navigable Small Worlds (HNSW).
  • Explain how Inverted File Indexes (IVF) partition vector space to optimize search.
  • Compare HNSW and IVF based on memory usage, query latency, and build time.
  • Identify AWS services that support these indexing algorithms (e.g., Aurora PostgreSQL with pgvector).

Key Terms & Glossary

  • Embedding: A numerical representation of data (like a word or image) in a high-dimensional vector space. Example: A 1536-dimensional vector representing the meaning of a sentence.
  • Approximate Nearest Neighbor (ANN): An algorithm category that finds "close enough" vectors quickly rather than calculating exact distances for every item in a database.
  • Centroid: The geometric center of a cluster of vectors in an IVF index.
  • Recall: The percentage of the true nearest neighbors found by an ANN search. High recall means high accuracy.
  • Dimensionality: The number of coordinates in a vector (e.g., 768 or 1536).

The "Big Idea"

Traditional databases use B-Trees to find exact matches or ranges of numbers. However, you cannot "sort" a vector. To find similar items, we must measure the distance between points in high-dimensional space. Since calculating distance against millions of vectors is too slow, we use Vector Indexes (HNSW/IVF) to create "shortcuts" that allow us to find similar items in milliseconds instead of seconds.


Formula / Concept Box

MetricDescriptionFormula / Logic
Euclidean Distance ($L_2)Straight-line distance between two points.d = \sqrt{\sum (q_i - v_i)^2}
Cosine SimilarityMeasures the angle between two vectors (ignores magnitude).\cos(\theta) = \frac{A \cdot B}{|A| |B|}
Inner Product (IP)Measures both angle and magnitude.A \cdot B = \sum A_i B_i$

Hierarchical Outline

  1. Vector Fundamentals
    • Embeddings translate context into numbers.
    • Vector databases store these numbers for semantic search.
  2. HNSW (Hierarchical Navigable Small Worlds)
    • Graph-based index.
    • Multi-layered approach for navigation.
    • High performance but high memory usage.
  3. IVF (Inverted File Index)
    • Partition-based index.
    • Groups vectors into clusters (Voronoi cells).
    • Uses a "centroid" to skip irrelevant clusters.
  4. AWS Implementation
    • Amazon Aurora (pgvector): Supports both HNSW and IVF.
    • Amazon OpenSearch: Native support for vector engines.
    • Amazon Bedrock: Uses vector indexes in Knowledge Bases.

Visual Anchors

HNSW Layered Architecture

Loading Diagram...

IVF Voronoi Partitioning

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • HNSW (Hierarchical Navigable Small Worlds): A graph-based index that connects points to their nearest neighbors across multiple layers.
    • Example: Finding a specific house in a new city by first looking at a state map (Top Layer), then a city map (Middle), then a street map (Bottom).
  • IVF (Inverted File Index): A technique that divides the vector space into buckets (Voronoi cells) and only searches the most relevant buckets.
    • Example: To find a specific book, you go only to the "Science Fiction" section instead of checking every book in the library.

Worked Examples

Example 1: Implementing HNSW in Amazon Aurora PostgreSQL

To use vector search in Aurora, you must use the pgvector extension.

Step 1: Enable the extension

sql
CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Create a table with a vector column

sql
CREATE TABLE product_embeddings ( id bigserial PRIMARY KEY, embedding vector(1536) -- 1536 is the dimension for OpenAI/Bedrock models );

Step 3: Create an HNSW index

sql
CREATE INDEX ON product_embeddings USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

[!NOTE] m is the number of connections per node, and ef_construction determines the search effort during index building.


Checkpoint Questions

  1. Which index type generally requires more RAM (Memory)?
  2. If your dataset is constantly changing with new inserts, is HNSW or IVF generally better for maintaining high recall without frequent re-training?
  3. What is the main purpose of the "Centroids" in an IVF index?

Comparison Tables

FeatureHNSW (Graph-based)IVF (List-based)
Search SpeedExtremely FastFast (depends on nprobe)
Memory UsageHigh (stores graph edges)Low (compressed lists)
Training RequiredNoYes (to find centroids)
Recall/AccuracyExcellentGood (can be tuned)
Best Use CaseLow latency, high RAM availableMassive datasets, cost-constrained

Muddy Points & Cross-Refs

  • HNSW Memory Consumption: Students often forget that HNSW requires the index to be loaded into memory for optimal performance. If your instance is memory-constrained, IVF is the safer choice.
  • IVF nprobe Parameter: This is a common point of confusion. nprobe defines how many clusters to search. Increasing nprobe increases recall but also increases query time (latency).
  • Cross-Ref: See Amazon Bedrock Knowledge Bases for how AWS automates the creation of these indexes for RAG (Retrieval-Augmented Generation) applications.

[!TIP] For the AWS Exam: Remember that HNSW is usually the default recommendation for performance, while IVF is used for scaling to billion-scale datasets where memory is the bottleneck.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free