Curriculum Overview: Sources of ML Models and Customization Strategies

Welcome to the foundational curriculum overview for understanding and selecting Machine Learning (ML) model sources. This guide is tailored for AI practitioners preparing to navigate the ecosystem of open-source pre-trained models, proprietary foundation models (FMs), and custom model training, particularly within the AWS ecosystem.

Prerequisites

Before diving into the sources and training methodologies of ML models, learners must have a firm grasp of the following foundational concepts:

Basic AI/ML Terminology: Familiarity with terms such as artificial intelligence, machine learning, deep learning, neural networks, and fit.
Learning Paradigms: Understanding the differences between supervised, unsupervised, and reinforcement learning.
Cloud Computing Basics: General awareness of cloud infrastructure (e.g., compute resources like GPUs/TPUs, storage).
The ML Lifecycle: High-level knowledge of the ML pipeline (data collection, preprocessing, training, evaluation, deployment).

[!IMPORTANT] If you are entirely new to Machine Learning, consider reviewing Unit 1: Fundamentals of AI and Machine Learning before proceeding with this curriculum.

Module Breakdown

This curriculum is structured to take you from consuming off-the-shelf models to architecting highly customized, pre-trained AI systems.

Module	Title	Difficulty	Core Focus
1	The ML Model Landscape	Beginner	Differentiating between proprietary, open-source, and custom models.
2	Open-Source & Pre-Trained Models	Intermediate	Utilizing repositories like Hugging Face and Amazon SageMaker JumpStart.
3	Model Customization Techniques	Intermediate	Prompt engineering, In-Context Learning, and Retrieval-Augmented Generation (RAG).
4	Fine-Tuning & Custom Training	Advanced	Adapting model weights, continuous pre-training, and scaling laws.
5	Licensing & Governance	Beginner	Navigating Apache 2.0, MIT, and GNU GPL licenses for enterprise compliance.

The Customization Continuum

Understanding the trade-offs between different model sources is critical. Below is a visual representation of how compute effort scales with the level of customization.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Learning Objectives per Module

Module 1: The ML Model Landscape

Identify the fundamental sources of ML models (e.g., open-source hubs, proprietary APIs, in-house development).
Evaluate the trade-offs between building a model from scratch versus leveraging an existing Foundation Model (FM).
Define how scaling laws apply to model performance relative to parameter count ( $N) and dataset size (D$ ).

Module 2: Open-Source & Pre-Trained Models

Navigate popular model repositories such as Hugging Face and Amazon SageMaker JumpStart.
Deploy a pre-trained model for batch or real-time inference without modifying its underlying weights.
Compare multi-modal models that process both textual and visual data.

Module 3: Model Customization Techniques

Implement lightweight customization via In-Context Learning and Prompt Engineering.
Integrate external knowledge bases using Retrieval-Augmented Generation (RAG) to reduce hallucinations.
Calculate the expected compute cost using standard cost formulas: $C_{total} = (C_{compute} \times T_{training}) + C_{data\_prep}$

Module 4: Fine-Tuning & Custom Training

Differentiate between full fine-tuning, instruction tuning, and continuous pre-training.
Prepare massive, high-quality, diverse datasets required for training an FM from scratch.
Apply distributed training frameworks (e.g., PyTorch DDP, DeepSpeed ZeRO) to manage memory constraints across GPUs.

Module 5: Licensing & Governance

Interpret End User License Agreements (EULAs) for ML models.
Distinguish between permissive licenses (Apache 2.0, MIT) and copyleft licenses (GNU GPL).
Implement source citation, data lineage tracking, and documentation using tools like Amazon SageMaker Model Cards.

▶Deep Dive: Open Source Licenses

When sourcing an open-source model, the license dictates how you can use it in production:

Apache 2.0 / MIT: Highly permissive. You can modify, use, and distribute the model freely in commercial applications, provided you include copyright notices.
GNU GPL: A "copyleft" license. If you modify the model and distribute your application, your derivative work must also be open-sourced under the same license.

Success Metrics

How will you know you have mastered this curriculum? You should be able to consistently demonstrate the following:

Architectural Decision Making: Given a business scenario (e.g., a tight budget vs. highly specialized proprietary data), successfully defend a choice between RAG, fine-tuning, or pre-training.
Platform Navigation: Successfully locate, deploy, and invoke a model from SageMaker JumpStart or Hugging Face within 30 minutes.
Compliance Literacy: Accurately identify the commercial viability of 5 different open-source models based strictly on their provided EULA.
Cost Estimation: Forecast the infrastructure costs associated with a batch inference pipeline versus a real-time endpoint.

Decision Framework Flowchart

Use this logic when assessing your mastery of model source selection:

Loading Diagram...

Real-World Application

Why does understanding the sources of ML models matter in your career as an AI Practitioner?

Cost Efficiency & ROI: Pre-training a model from scratch can cost millions of dollars in compute (GPUs/TPUs) and require terabytes of curated data. By leveraging a pre-trained open-source model from Hugging Face, an organization can reduce development costs by up to 90% and dramatically accelerate time-to-market.
Legal Compliance: Using a GNU GPL-licensed model in a proprietary, closed-source enterprise application can trigger severe legal liabilities. Understanding license types ensures you protect your organization's intellectual property.
Strategic Customization: A retail company wanting to generate product recommendations overnight might use Batch Inference with a fine-tuned open-source model to process large data volumes cost-effectively, rather than paying per-token for real-time API calls to a proprietary model.

[!TIP] Always start small. Before investing in fine-tuning or custom pre-training, test whether a robust prompt and a pre-trained model (like those available via Amazon Bedrock) can solve 80% of your business problem.

Glossary of Key Terms

Foundation Model (FM): A massive AI model trained on a vast quantity of unlabeled data at scale, which can be adapted to a wide range of downstream tasks.
- Example: OpenAI's GPT-4 or Anthropic's Claude.
Pre-training: The initial phase of training a model from scratch on massive datasets to learn underlying patterns, structures, and representations.
- Example: Training a new language model on the entire English Wikipedia.
Fine-tuning: Taking a pre-trained model and training it further on a smaller, specialized dataset to adapt it for a specific task.
- Example: Taking a general medical text model and fine-tuning it specifically to recognize pediatric cardiology terminology.
In-Context Learning: Adjusting the model's output via instructions within the prompt itself, without permanently altering the model's underlying weights.
- Example: Providing a large language model with three examples of well-formatted JSON before asking it to generate its own.