Curriculum Overview: Data and Compute Services in Azure Machine Learning
Describe data and compute services for data science and machine learning
Curriculum Overview: Data and Compute Services in Azure Machine Learning
This curriculum provides a comprehensive guide to understanding how data and compute resources are managed within the Microsoft Azure ecosystem to support data science and machine learning (ML) workloads. It specifically focuses on the Azure Machine Learning (AML) service as the central hub for the ML lifecycle.
Prerequisites
Before engaging with this curriculum, learners should have a foundational understanding of the following:
- Cloud Fundamentals: Basic knowledge of cloud computing concepts (SaaS, PaaS, IaaS) and Azure subscription management.
- Machine Learning Basics: Familiarity with the differences between regression, classification, and clustering.
- Data Concepts: Understanding of basic dataset structures (features and labels) and the purpose of training versus validation sets.
- Azure AI Workloads: General awareness of common AI scenarios like Computer Vision and Natural Language Processing (Unit 1 concepts).
Module Breakdown
| Module | Topic | Difficulty | Focus Area |
|---|---|---|---|
| 1 | The Azure ML Workspace | Beginner | Studio Interface, Resources, and Security |
| 2 | Data Assets & Storage | Intermediate | Data Stores, Datasets, and Data Exploration |
| 3 | Compute Resources | Intermediate | Compute Instances, Clusters, and Inference Nodes |
| 4 | No-Code Training Tools | Intermediate | AutoML and Azure Machine Learning Designer |
| 5 | Model Management | Advanced | Registration, Versioning, and Deployment |
Learning Objectives per Module
Module 1: The Azure ML Workspace
- Describe the role of Azure Machine Learning Studio as a unified platform.
- Identify how to create and manage a workspace for team collaboration.
Module 2: Data Assets & Storage
- Distinguish between Data Stores (connection to storage) and Data Assets (versioned references to data).
- Explain how to import and explore data directly within the Studio environment.
Module 3: Compute Resources
- Identify the four primary compute types: Compute Instances (workstations), Compute Clusters (scalable training), Inference Clusters (deployment), and Attached Compute.
- Determine the appropriate compute resource based on the workload (e.g., development vs. production).
Module 4: No-Code Training Tools
- Describe the capabilities of Automated Machine Learning (AutoML) for rapid model selection.
- Explain how Azure Machine Learning Designer uses a visual drag-and-drop interface for pipeline creation.
Module 5: Model Management
- Understand the process of Registering a model to track versions.
- Identify deployment options for real-time or batch inferencing.
Visual Overview of Resources
The ML Lifecycle Flow
Workspace Resource Architecture
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, minimum width=3cm, minimum height=1cm, align=center}]
% Define components \node (workspace) [fill=blue!10] {Azure ML Workspace}; \node (compute) [below left of=workspace, xshift=-1cm, fill=green!10] {Compute Resources$Instances/Clusters)}; \node (data) [below right of=workspace, xshift=1cm, fill=orange!10] {Data Resources$Datastore/Assets)}; \node (tools) [below of=workspace, yshift=-3cm, fill=purple!10] {Authoring Tools$AutoML/Designer/Notebooks)};
% Draw connections \draw [<->] (workspace) -- (compute); \draw [<->] (workspace) -- (data); \draw [->] (compute) -- (tools); \draw [->] (data) -- (tools); \draw [dashed] (workspace) -- (tools);
\node[draw=none, below of=tools, yshift=0.5cm] {\small \textit{Figure 1: Relationship between Workspace, Infrastructure, and Tools}};
\end{tikzpicture}
Success Metrics
To demonstrate mastery of this curriculum, the learner must be able to:
- Select Compute: Correctily choose a Compute Cluster for a job requiring multiple GPUs and a Compute Instance for a Jupyter Notebook development session.
- Navigate Studio: Successfully locate the "Data" and "Compute" tabs in Azure ML Studio to provision resources.
- Explain AutoML: Describe how AutoML automates the selection of algorithms and hyperparameters to save time.
- Differentiate Tools: Explain when to use the Designer (visual workflow) versus Notebooks (code-first with PyTorch/TensorFlow).
- Identify Deployment: Define the difference between a real-time endpoint for instant predictions and batch inferencing for large datasets.
Real-World Application
[!IMPORTANT] Why this matters: In a professional setting, data scientists spend up to 80% of their time on data preparation and infrastructure management. Mastery of Azure's compute and data services allows teams to:
- Scale Efficiently: Use "Zero-node" clusters that only charge you when a job is running, significantly reducing cloud costs.
- Maintain Reproducibility: By using versioned Data Assets, teams can ensure that the exact dataset used to train a model in 2023 can be referenced again in 2025 for auditing.
- Collaborate Securely: Use a centralized Workspace to share models and data without passing around CSV files or login credentials, ensuring compliance with Responsible AI privacy principles.