AWS Study Guide: Choosing Built-in Algorithms and Foundation Models
Choosing built-in algorithms, foundation models, and solution templates (for example, in SageMaker JumpStart and Amazon Bedrock)
Choosing Built-in Algorithms and Foundation Models
This guide covers the strategic selection of machine learning components within the AWS ecosystem, specifically focusing on the trade-offs between managed AI services, Amazon Bedrock, and SageMaker JumpStart.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between AWS AI Services, Amazon Bedrock, and SageMaker built-in algorithms.
- Select the appropriate foundation model or algorithm based on business constraints like cost, time, and interpretability.
- Utilize SageMaker JumpStart to deploy prebuilt solution templates and fine-tune models.
- Evaluate the trade-offs between model complexity and interpretability.
Key Terms & Glossary
- Foundation Model (FM): A large-scale model trained on a vast dataset that can be adapted to various downstream tasks (e.g., text generation, summarization).
- SageMaker JumpStart: A hub within SageMaker that provides one-click access to hundreds of pre-trained models and end-to-end solution templates.
- Amazon Bedrock: A fully managed service that offers a choice of high-performing foundation models via an API.
- Interpretability: The degree to which a human can understand the cause of a decision made by an ML model.
- Linear Learner: A SageMaker built-in algorithm used specifically for binary classification or regression.
- XGBoost: An optimized gradient-boosted tree algorithm known for high accuracy in structured data problems.
The "Big Idea"
In the AWS ecosystem, the goal is acceleration. You should always prefer the highest level of abstraction that meets your needs. Start with AI Services (ready-to-use) or Amazon Bedrock (API-based FMs). If you need more customization or specific data science control, move to SageMaker JumpStart. Only move to SageMaker Built-in Algorithms or custom code when you need to optimize for specific performance metrics, scale, or cost structures not met by higher-level services.
Formula / Concept Box
| Selection Metric | AI Services / Bedrock | SageMaker JumpStart | SageMaker Built-ins |
|---|---|---|---|
| ML Expertise | Low | Medium | High |
| Customization | Minimal (Fine-tuning Bedrock) | High (Full access to notebooks) | Very High (Hyperparameter Tuning) |
| Deployment | Serverless / API | Managed Endpoints | Managed Endpoints |
| Example | Amazon Translate | Llama-3 on JumpStart | SageMaker XGBoost |
Hierarchical Outline
- AWS Artificial Intelligence (AI) Services
- Managed Services: Rekognition (Images), Transcribe (Speech), Lex (Chatbots).
- Use Case: Solving specific business problems with zero ML model management.
- Amazon Bedrock
- Generative AI Focus: Access to models like Amazon Nova, Claude, and Llama via API.
- Capabilities: Text generation, summarization, and question-answering.
- Amazon SageMaker JumpStart
- Pre-trained Models: Access to hundreds of models from popular hubs.
- Solution Templates: Prebuilt workflows for fraud detection, forecasting, etc.
- Fine-tuning: Ability to use custom datasets on pre-trained FMs.
- SageMaker Built-in Algorithms
- Supervised: Linear Learner, XGBoost, k-NN.
- Unsupervised: K-Means, PCA, Random Cut Forest (Anomaly Detection).
- Optimization: Highly optimized for AWS infrastructure (speed/scale).
Visual Anchors
Model Selection Flowchart
The Trade-off Triangle
Definition-Example Pairs
- Prebuilt Solution Templates: End-to-end cloud formations that deploy necessary resources for a specific use case.
- Example: A "Demand Forecasting" template that deploys an S3 bucket, a SageMaker notebook, and an endpoint simultaneously.
- Script Mode: Training models using custom Python scripts (TensorFlow/PyTorch) while still using SageMaker infrastructure.
- Example: Using a specific version of PyTorch not available in built-ins to perform custom deep learning.
Worked Examples
Example 1: Selecting a Foundation Model
Scenario: A company needs to build a legal document summarizer. They have no data scientists but plenty of web developers. Selection: Amazon Bedrock. Reasoning: Bedrock provides a serverless API. Developers can send the document text to a model (like Claude) and receive a summary without managing any underlying infrastructure or training clusters.
Example 2: Structured Data Classification
Scenario: A bank wants to predict credit card churn based on customer transaction history (CSV data). Selection: SageMaker XGBoost. Reasoning: This is a classic supervised learning problem on structured data. XGBoost is a built-in algorithm optimized for this specific task and offers better performance/cost than a Foundation Model.
Checkpoint Questions
- What service should you use if you want to deploy a pre-trained Llama-3 model into your own VPC for privacy?
- Which built-in algorithm is best suited for finding vector representations of objects?
- True/False: Amazon Bedrock requires the user to manage the underlying EC2 instances for the models.
- When would you choose SageMaker Built-in Algorithms over Amazon Bedrock?
Muddy Points & Cross-Refs
- JumpStart vs. Bedrock: This is the most common confusion. Bedrock is serverless and API-driven. JumpStart gives you the "ingredients" (notebooks, models) to bake inside SageMaker. Choose JumpStart if you need to deeply customize the training code or infrastructure.
- Cost Considerations: AI services are usually pay-per-request. SageMaker endpoints are usually pay-per-hour for the instance. For low-volume tasks, Bedrock/AI Services are cheaper; for high-volume, dedicated SageMaker endpoints might be more cost-effective.
Comparison Tables
Supervised vs. Unsupervised Built-ins
| Algorithm | Type | Primary Use Case |
|---|---|---|
| Linear Learner | Supervised | Binary/Multiclass Classification; Regression |
| XGBoost | Supervised | High-performance tabular data ranking/classification |
| k-NN | Supervised | Classification based on nearest data points |
| K-Means | Unsupervised | Grouping similar customers into segments |
| PCA | Unsupervised | Reducing the number of features (Dimensionality Reduction) |
| Random Cut Forest | Unsupervised | Detecting anomalies in time-series data |