BrainyBeeBrainyBee
ExploreBlogStart Studying
HomeAWS Certified Data Engineer - Associate (DEA-C01)AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)
Study Guide890 words

AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)

Describe vector index types (for example, HNSW, IVF)

Vector Indexing: HNSW and IVF for High-Dimensional Search

In the era of Generative AI and Large Language Models (LLMs), searching through massive datasets of unstructured data (text, images, audio) requires a new approach: Vector Search. This guide explores the two primary indexing types used in AWS services like Amazon Aurora (pgvector) and Amazon OpenSearch: HNSW and IVF.


Learning Objectives

By the end of this guide, you should be able to:

  • Explain the concept of Approximate Nearest Neighbor (ANN) search.
  • Describe the architecture and performance trade-offs of Hierarchical Navigable Small Worlds (HNSW).
  • Explain how Inverted File Indexes (IVF) partition vector space to optimize search.
  • Compare HNSW and IVF based on memory usage, query latency, and build time.
  • Identify AWS services that support these indexing algorithms (e.g., Aurora PostgreSQL with pgvector).

Key Terms & Glossary

  • Embedding: A numerical representation of data (like a word or image) in a high-dimensional vector space. Example: A 1536-dimensional vector representing the meaning of a sentence.
  • Approximate Nearest Neighbor (ANN): An algorithm category that finds "close enough" vectors quickly rather than calculating exact distances for every item in a database.
  • Centroid: The geometric center of a cluster of vectors in an IVF index.
  • Recall: The percentage of the true nearest neighbors found by an ANN search. High recall means high accuracy.
  • Dimensionality: The number of coordinates in a vector (e.g., 768 or 1536).

The "Big Idea"

Traditional databases use B-Trees to find exact matches or ranges of numbers. However, you cannot "sort" a vector. To find similar items, we must measure the distance between points in high-dimensional space. Since calculating distance against millions of vectors is too slow, we use Vector Indexes (HNSW/IVF) to create "shortcuts" that allow us to find similar items in milliseconds instead of seconds.


Formula / Concept Box

MetricDescriptionFormula / Logic
Euclidean Distance ($L_2)Straight-line distance between two points.d = \sqrt{\sum (q_i - v_i)^2}$
Cosine SimilarityMeasures the angle between two vectors (ignores magnitude).cos⁡(θ)=A⋅B∥A∥∥B∥\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}cos(θ)=∥A∥∥B∥A⋅B​
Inner Product ($IP)Measures both angle and magnitude.A \cdot B = \sum A_i B_i$

Hierarchical Outline

  1. Vector Fundamentals
    • Embeddings translate context into numbers.
    • Vector databases store these numbers for semantic search.
  2. HNSW (Hierarchical Navigable Small Worlds)
    • Graph-based index.
    • Multi-layered approach for navigation.
    • High performance but high memory usage.
  3. IVF (Inverted File Index)
    • Partition-based index.
    • Groups vectors into clusters (Voronoi cells).
    • Uses a "centroid" to skip irrelevant clusters.
  4. AWS Implementation
    • Amazon Aurora (pgvector): Supports both HNSW and IVF.
    • Amazon OpenSearch: Native support for vector engines.
    • Amazon Bedrock: Uses vector indexes in Knowledge Bases.

Visual Anchors

HNSW Layered Architecture

Loading Diagram...

IVF Voronoi Partitioning

Compiling TikZ diagram…
⏳
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • HNSW (Hierarchical Navigable Small Worlds): A graph-based index that connects points to their nearest neighbors across multiple layers.
    • Example: Finding a specific house in a new city by first looking at a state map (Top Layer), then a city map (Middle), then a street map (Bottom).
  • IVF (Inverted File Index): A technique that divides the vector space into buckets (Voronoi cells) and only searches the most relevant buckets.
    • Example: To find a specific book, you go only to the "Science Fiction" section instead of checking every book in the library.

Worked Examples

Example 1: Implementing HNSW in Amazon Aurora PostgreSQL

To use vector search in Aurora, you must use the pgvector extension.

Step 1: Enable the extension

sql
CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Create a table with a vector column

sql
CREATE TABLE product_embeddings ( id bigserial PRIMARY KEY, embedding vector(1536) -- 1536 is the dimension for OpenAI/Bedrock models );

Step 3: Create an HNSW index

sql
CREATE INDEX ON product_embeddings USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

[!NOTE] m is the number of connections per node, and ef_construction determines the search effort during index building.


Checkpoint Questions

  1. Which index type generally requires more RAM (Memory)?
  2. If your dataset is constantly changing with new inserts, is HNSW or IVF generally better for maintaining high recall without frequent re-training?
  3. What is the main purpose of the "Centroids" in an IVF index?

Comparison Tables

FeatureHNSW (Graph-based)IVF (List-based)
Search SpeedExtremely FastFast (depends on nprobe)
Memory UsageHigh (stores graph edges)Low (compressed lists)
Training RequiredNoYes (to find centroids)
Recall/AccuracyExcellentGood (can be tuned)
Best Use CaseLow latency, high RAM availableMassive datasets, cost-constrained

Muddy Points & Cross-Refs

  • HNSW Memory Consumption: Students often forget that HNSW requires the index to be loaded into memory for optimal performance. If your instance is memory-constrained, IVF is the safer choice.
  • IVF nprobe Parameter: This is a common point of confusion. nprobe defines how many clusters to search. Increasing nprobe increases recall but also increases query time (latency).
  • Cross-Ref: See Amazon Bedrock Knowledge Bases for how AWS automates the creation of these indexes for RAG (Retrieval-Augmented Generation) applications.

[!TIP] For the AWS Exam: Remember that HNSW is usually the default recommendation for performance, while IVF is used for scaling to billion-scale datasets where memory is the bottleneck.

All AWS Certified Data Engineer - Associate (DEA-C01) Study Resources

Related Notes

  • AWS Data Engineering: Addressing Changes to Data Characteristics945 words
  • Analyzing Logs with AWS Services: A Study Guide945 words
  • Mastering Log Analysis with AWS Services: DEA-C01 Study Guide925 words
  • AWS Authorization Methods: RBAC, ABAC, and TBAC1,152 words
  • Applying IAM Policies to Roles, Endpoints, and Services1,150 words
  • AWS Storage Services: Purpose-Built Data Stores and Vector Indexing940 words
  • Curriculum Overview: AWS Audit Logs and Governance for Data Engineers875 words
  • Hands-On Lab: Implementing and Analyzing Audit Logs in AWS850 words
  • Curriculum Overview: Authentication Mechanisms for AWS Data Engineering845 words
  • Lab: Implementing Secure Authentication with IAM Roles and Secrets Manager945 words
  • Curriculum Overview: AWS Authorization Mechanisms for Data Engineers785 words
  • Lab: Implementing Least-Privilege Authorization with IAM Roles and Policies850 words

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up.

Start Studying

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free
AWS Certified Data Engineer - Associate (DEA-C01) ResourcesExplore All HivesBlogHome

© 2026 BrainyBee. Free AI-powered exam prep.