Elements of the Machine Learning Training Process

This guide covers the core mechanics of training a machine learning model, specifically focusing on the levers (hyperparameters) that control the optimization process: epochs, steps, and batch sizes.

Learning Objectives

After studying this guide, you should be able to:

Distinguish between internal model parameters and external hyperparameters.
Define the relationship between epochs, batch size, and iterations (steps).
Analyze the trade-offs between small and large batch sizes on model convergence and memory.
Calculate the number of steps required for a training job given a dataset size.

Key Terms & Glossary

Hyperparameters: Settings configured before training begins (e.g., learning rate, batch size) that remain constant during the run.
Parameters: Internal variables (e.g., weights, biases) that the model learns and updates during training.
Epoch: One complete pass of the entire training dataset through the machine learning algorithm.
Batch Size: The number of training examples utilized in one iteration to update the model parameters.
Iteration (Step): A single update of the model's internal parameters using one batch of data.
Loss Function: A mathematical method to quantify the "error" between the model's prediction and the actual target.

The "Big Idea"

At its heart, machine learning training is an optimization problem. The goal is to minimize a loss function by adjusting internal weights. However, the model cannot decide for itself how it should learn; the developer must set the "pacing" and "granularity" of this learning process. These settings (Epochs and Batch Size) act as the metronome for the optimization algorithm, determining how frequently the model updates its knowledge and how many times it reviews the material.

Formula / Concept Box

Relationship	Formula	Description
Total Steps	$Steps = \frac{Total\,Samples}{Batch\,Size} \times Epochs$	Total number of parameter updates in a job.
Steps per Epoch	$Steps\,per\,Epoch = \lceil \frac{N}{B} \rceil$	How many updates occur before the model has seen all data once.

[!IMPORTANT] If your dataset has 1,000 samples and your batch size is 100, one epoch consists of 10 steps.

Hierarchical Outline

Model Configuration
- Parameters: Weights and Biases (learned internal data).
- Hyperparameters: Knobs turned by the engineer (external config).
The Pacing of Training
- Batch Size: Controls the "granularity" of updates.
- Epochs: Controls the "duration" of learning.
Optimization Mechanics
- Gradient Descent: The process of moving toward the minimum loss.
- Stochastic Gradient Descent (SGD): Using small batches to estimate the gradient.
Hardware Considerations
- Memory Constraints: Large batches require more GPU/RAM.
- Parallelization: Distributing batches across multiple compute nodes.

Visual Anchors

Training Flow Hierarchy

Loading Diagram...

Impact of Batch Size on Convergence

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Batch Size
- Definition: The partition size of the dataset used for a single gradient update.
- Example: Training on 32 images at a time from a library of 10,000 photos.
Epoch
- Definition: A full cycle through the entire training dataset.
- Example: Reading a textbook from cover to cover exactly once.
Learning Rate
- Definition: The magnitude of the step taken toward the minimum loss during an update.
- Example: A hiker taking large steps vs. tiny shuffles to reach the bottom of a valley.

Worked Examples

Example 1: Calculating Training Steps

Problem: You are training a model on Amazon SageMaker with a dataset of 50,000 records. You set your batch_size to 125 and intend to run for 20 epochs. How many total steps (parameter updates) will the model perform?

Solution:

Calculate steps per epoch: $50,000 / 125 = 400$ steps.
Multiply by number of epochs: $400 \times 20 = 8,000$ total steps.

Example 2: Out of Memory (OOM) Error

Scenario: A data scientist is training a Large Language Model. They set the batch size to 1024, but the training job fails immediately with an "Out of Memory" error. Fix: Reduce the batch_size to 32 or 64. This reduces the number of samples stored in the GPU memory simultaneously, allowing the training to proceed, albeit with more total steps required per epoch.

Checkpoint Questions

What is the main difference between a parameter and a hyperparameter?
If you increase your batch size, do you need more or less memory on your training instance?
Why might training for too many epochs be a bad thing for model generalization?
If a dataset has 500 samples and the batch size is 500, what is the relationship between one step and one epoch?

▶Click to see answers

Parameters are learned during training (internal); hyperparameters are set before training (external).
More memory.
It leads to overfitting, where the model memorizes the training data rather than learning general patterns.
One step is equal to one epoch.

Muddy Points & Cross-Refs

Steps vs. Iterations: These terms are used interchangeably in most AWS documentation and ML frameworks.
Underfitting vs. Overfitting:
- Underfitting: Too few epochs (model hasn't learned the patterns yet).
- Overfitting: Too many epochs (model has learned the "noise" of the data).
Validation Frequency: Usually, models are evaluated against the validation set at the end of every epoch, not every step.

Comparison Tables

Parameters vs. Hyperparameters

Feature	Parameters	Hyperparameters
Source	Learned from data	Set by the Engineer
Time of Setting	During training	Before training
Examples	Weights, Biases, Centroids	Learning rate, Batch size, Epochs
Purpose	Represent the model's logic	Control the learning process

Batch Size Trade-offs

Size	Speed per Step	Gradient Quality	Memory Usage
Small (e.g., 32)	Faster	Noisy (adds regularization)	Low
Large (e.g., 1024)	Slower	Smooth/Accurate	High