Vectorization and Amazon Bedrock Knowledge Bases

This study guide covers the core concepts of vectorization, embedding models, and how AWS services like Amazon Bedrock facilitate Retrieval-Augmented Generation (RAG) through managed knowledge bases.

Learning Objectives

After studying this guide, you should be able to:

Define vectorization and its role in modern generative AI architectures.
Explain the workflow of an Amazon Bedrock Knowledge Base.
Compare vector index types, specifically HNSW and IVF.
Identify AWS-native options for storing and querying vector embeddings.

Key Terms & Glossary

Embedding: A numerical representation of data (text, images, audio) in a high-dimensional vector space where similar items are mathematically close to each other.
Vectorization: The process of converting unstructured data into embeddings using an embedding model.
RAG (Retrieval-Augmented Generation): A technique that retrieves relevant information from a private data source and provides it to an LLM as context to improve answer accuracy.
Chunking: The process of breaking down large documents into smaller, manageable segments before vectorization to preserve context and stay within model token limits.
Distance Metric: Mathematical formulas (e.g., Cosine Similarity, Euclidean Distance) used to measure the "closeness" of vectors during a search.

The "Big Idea"

In traditional databases, we search for exact matches (e.g., "Find customer_id 123"). In AI, we search for semantic meaning (e.g., "Find documents that discuss cloud security"). Vectorization is the bridge that turns human concepts into mathematical coordinates, allowing computers to "understand" similarity without needing keyword matches.

Formula / Concept Box

Concept	Description	Mathematical/Logic Hint
Cosine Similarity	Measures the angle between two vectors.	$\cos(\theta) = \frac{A \cdot B}{\\|A\\| \\|B\\|}$ (Higher is more similar)
Euclidean Distance	Measures the straight-line distance between points.	$d = \sqrt{\sum (x_i - y_i)^2}$ (Lower is more similar)
Dot Product	Measures the magnitude and direction.	$A \cdot B = \sum a_i b_i$

Hierarchical Outline

The Vectorization Pipeline
- Data Source: Storing raw data (e.g., PDFs in Amazon S3).
- Chunking: Segmenting data into overlapping or fixed-size blocks.
- Embedding Model: Passing chunks through models like Amazon Titan Text Embeddings.
Amazon Bedrock Knowledge Bases
- Ingestion Job: The automated workflow of syncing S3 to a Vector Store.
- Managed Integration: Automatic handling of chunking, embedding, and storage.
Vector Indexing & Storage
- HNSW: Hierarchical Navigable Small Worlds (Graph-based, high speed/accuracy).
- IVF: Inverted File Index (Cluster-based, lower memory footprint).
- Supported Stores: Amazon OpenSearch Serverless, Amazon Aurora (pgvector), Pinecone.

Visual Anchors

The RAG Workflow (Amazon Bedrock)

Loading Diagram...

Geometric Vector Space

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Term: Semantic Search
Definition: Searching based on the intent and contextual meaning of words rather than literal characters.
Example: Searching for "feline healthcare" and receiving results for "cat vet visits" even though the words do not match.
Term: Hierarchical Navigable Small World (HNSW)
Definition: An indexing algorithm that builds a multi-layered graph to quickly find the nearest neighbors in high-dimensional space.
Example: Like a highway system (top layer) connecting to local streets (bottom layer) to find a specific house address efficiently.

Worked Examples

Setting up a Knowledge Base for HR Documents

Ingest: Upload 500 PDF policy manuals to an Amazon S3 bucket.
Configure: In Amazon Bedrock, create a Knowledge Base. Select the S3 bucket as the source.
Model Selection: Choose Titan Text Embeddings v2. This model converts the text into 1,536-dimensional vectors.
Storage: Select Amazon OpenSearch Serverless as the vector store.
Test: Ask the Bedrock console: "What is the policy for parental leave?"
- Internal Logic: Bedrock embeds your question, finds the top 3 relevant chunks in OpenSearch, and feeds them to Claude 3 to summarize the answer.

Checkpoint Questions

Why is "Chunking" necessary before vectorization?
What is the main difference between HNSW and IVF indexing?
Which AWS service allows you to use SQL to perform vector searches?
How does a Knowledge Base help reduce LLM hallucinations?

[!TIP] Answer Key: 1. To fit model context windows and keep meanings focused. 2. HNSW uses graphs (fast/accurate); IVF uses clustering (memory efficient). 3. Amazon Aurora (with pgvector). 4. By providing factual "grounding" data to the model.

Comparison Tables

Feature	HNSW (Graph-based)	IVF (Clustering-based)
Search Speed	Extremely Fast	Fast (but slower than HNSW at scale)
Accuracy	High	Medium/High
Memory Usage	High (stores graph edges)	Low (stores centroids)
Best Use Case	When latency is critical.	When memory/cost is limited.

Muddy Points & Cross-Refs

Dimensions vs. Performance: Increasing the number of dimensions in an embedding (e.g., from 512 to 1536) often improves accuracy but increases storage costs and search latency.
Overlapping Chunks: When chunking, we often use a "20% overlap." This ensures that if a concept is split between two chunks, the context isn't lost.
Cross-Ref: For more on vector storage, see documentation on Amazon Aurora pgvector extension and Amazon OpenSearch Vector Engine.

Vectorization and Amazon Bedrock Knowledge Bases

This study guide covers the core concepts of vectorization, embedding models, and how AWS services like Amazon Bedrock facilitate Retrieval-Augmented Generation (RAG) through managed knowledge bases.

Learning Objectives

After studying this guide, you should be able to:

Define vectorization and its role in modern generative AI architectures.
Explain the workflow of an Amazon Bedrock Knowledge Base.
Compare vector index types, specifically HNSW and IVF.
Identify AWS-native options for storing and querying vector embeddings.

Key Terms & Glossary

Embedding: A numerical representation of data (text, images, audio) in a high-dimensional vector space where similar items are mathematically close to each other.
Vectorization: The process of converting unstructured data into embeddings using an embedding model.
RAG (Retrieval-Augmented Generation): A technique that retrieves relevant information from a private data source and provides it to an LLM as context to improve answer accuracy.
Chunking: The process of breaking down large documents into smaller, manageable segments before vectorization to preserve context and stay within model token limits.
Distance Metric: Mathematical formulas (e.g., Cosine Similarity, Euclidean Distance) used to measure the "closeness" of vectors during a search.

The "Big Idea"

Formula / Concept Box

Concept	Description	Mathematical/Logic Hint
Cosine Similarity	Measures the angle between two vectors.	$\cos(\theta) = \frac{A \cdot B}{\\|A\\| \\|B\\|}$ (Higher is more similar)
Euclidean Distance	Measures the straight-line distance between points.	$d = \sqrt{\sum (x_i - y_i)^2}$ (Lower is more similar)
Dot Product	Measures the magnitude and direction.	$A \cdot B = \sum a_i b_i$

Hierarchical Outline

The Vectorization Pipeline
- Data Source: Storing raw data (e.g., PDFs in Amazon S3).
- Chunking: Segmenting data into overlapping or fixed-size blocks.
- Embedding Model: Passing chunks through models like Amazon Titan Text Embeddings.
Amazon Bedrock Knowledge Bases
- Ingestion Job: The automated workflow of syncing S3 to a Vector Store.
- Managed Integration: Automatic handling of chunking, embedding, and storage.
Vector Indexing & Storage
- HNSW: Hierarchical Navigable Small Worlds (Graph-based, high speed/accuracy).
- IVF: Inverted File Index (Cluster-based, lower memory footprint).
- Supported Stores: Amazon OpenSearch Serverless, Amazon Aurora (pgvector), Pinecone.

Visual Anchors

The RAG Workflow (Amazon Bedrock)

Loading Diagram...

Geometric Vector Space

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Term: Semantic Search
Definition: Searching based on the intent and contextual meaning of words rather than literal characters.
Example: Searching for "feline healthcare" and receiving results for "cat vet visits" even though the words do not match.
Term: Hierarchical Navigable Small World (HNSW)
Definition: An indexing algorithm that builds a multi-layered graph to quickly find the nearest neighbors in high-dimensional space.
Example: Like a highway system (top layer) connecting to local streets (bottom layer) to find a specific house address efficiently.

Worked Examples

Setting up a Knowledge Base for HR Documents

Ingest: Upload 500 PDF policy manuals to an Amazon S3 bucket.
Configure: In Amazon Bedrock, create a Knowledge Base. Select the S3 bucket as the source.
Model Selection: Choose Titan Text Embeddings v2. This model converts the text into 1,536-dimensional vectors.
Storage: Select Amazon OpenSearch Serverless as the vector store.
Test: Ask the Bedrock console: "What is the policy for parental leave?"
- Internal Logic: Bedrock embeds your question, finds the top 3 relevant chunks in OpenSearch, and feeds them to Claude 3 to summarize the answer.

Checkpoint Questions

Why is "Chunking" necessary before vectorization?
What is the main difference between HNSW and IVF indexing?
Which AWS service allows you to use SQL to perform vector searches?
How does a Knowledge Base help reduce LLM hallucinations?

[!TIP] Answer Key: 1. To fit model context windows and keep meanings focused. 2. HNSW uses graphs (fast/accurate); IVF uses clustering (memory efficient). 3. Amazon Aurora (with pgvector). 4. By providing factual "grounding" data to the model.

Comparison Tables

Feature	HNSW (Graph-based)	IVF (Clustering-based)
Search Speed	Extremely Fast	Fast (but slower than HNSW at scale)
Accuracy	High	Medium/High
Memory Usage	High (stores graph edges)	Low (stores centroids)
Best Use Case	When latency is critical.	When memory/cost is limited.

Muddy Points & Cross-Refs

Dimensions vs. Performance: Increasing the number of dimensions in an embedding (e.g., from 512 to 1536) often improves accuracy but increases storage costs and search latency.
Overlapping Chunks: When chunking, we often use a "20% overlap." This ensures that if a concept is split between two chunks, the context isn't lost.
Cross-Ref: For more on vector storage, see documentation on Amazon Aurora pgvector extension and Amazon OpenSearch Vector Engine.

Study Guide: Vectorization and Amazon Bedrock Knowledge Bases

Vectorization and Amazon Bedrock Knowledge Bases

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The RAG Workflow (Amazon Bedrock)

Geometric Vector Space

Definition-Example Pairs

Worked Examples

Setting up a Knowledge Base for HR Documents

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs

Study Guide: Vectorization and Amazon Bedrock Knowledge Bases

Vectorization and Amazon Bedrock Knowledge Bases

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The RAG Workflow (Amazon Bedrock)

Geometric Vector Space

Definition-Example Pairs

Worked Examples

Setting up a Knowledge Base for HR Documents

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs