Mastering Data Formats for Machine Learning Workflows
Choosing appropriate data formats (for example, Parquet, JSON, CSV, ORC) based on data access patterns
Mastering Data Formats for Machine Learning Workflows
Choosing the correct data format is a critical skill for a Machine Learning Engineer. The choice impacts storage costs, query performance, and the speed of model training. This guide focuses on aligning data formats (CSV, JSON, Parquet, ORC, Avro) with specific data access patterns and SageMaker requirements.
Learning Objectives
- Distinguish between row-based (CSV, Avro) and columnar (Parquet, ORC) storage formats.
- Select the appropriate data format based on data volume, schema flexibility, and query patterns.
- Identify which Amazon SageMaker built-in algorithms support specific data formats.
- Analyze how data access patterns (size and shape) influence formatting decisions.
Key Terms & Glossary
- Columnar Storage: A data storage architecture that stores data tables by column rather than by row. Ideal for analytical queries where only a subset of columns is needed.
- Semi-structured Data: Data that does not reside in a fixed-record length format but contains tags or markers to separate semantic elements (e.g., JSON).
- Serialization: The process of converting a data structure or object into a format that can be stored or transmitted and reconstructed later.
- RecordIO: A binary format used primarily by the Apache MXNet framework, optimized for high-throughput streaming in SageMaker.
- Schema-on-Read: Data is formatted and structured only when it is read (e.g., CSV, JSON).
- Schema-on-Write: Data must conform to a predefined schema before being written (e.g., Avro, Parquet).
The "Big Idea"
In ML Engineering, the "Data Handshake" is the alignment between how data is stored on disk and how it is consumed by a model. If your model only needs 3 features out of 200, but your data is in a row-based format like CSV, you are wasting I/O and time reading 197 unnecessary values. Choosing the right format ensures that hardware resources are used for computation, not just moving data.
Formula / Concept Box
| Access Pattern | Recommended Format | Key Reason |
|---|---|---|
| Full Row Retrieval | CSV / Avro | Low overhead for reading entire records. |
| Analytical / Columnar Scan | Parquet / ORC | Efficient compression and "predicate pushdown" (reading only needed columns). |
| Hierarchical / Nested Data | JSON | Native support for complex, varying structures. |
| High-Throughput Streaming | RecordIO | Optimized for pipe mode in SageMaker to stream data from S3. |
Hierarchical Outline
- Row-Based Formats
- CSV (Comma-Separated Values): Simple, human-readable, best for small datasets and basic tabular structures.
- Avro: Binary, row-based format that stores the schema with the data; excellent for write-heavy logging and evolution.
- Columnar Formats
- Apache Parquet: Highly optimized for complex nested data and big data analytics (Hadoop/Spark ecosystem).
- Apache ORC: Optimized Row Columnar; similar to Parquet but often used in Hive environments; provides high compression.
- Semi-Structured Formats
- JSON (JavaScript Object Notation): Flexible, used for web logs and document-based data; requires more parsing overhead than binary formats.
- Algorithm Compatibility
- SageMaker Specifics: Understanding which algorithms accept CSV vs. Parquet vs. RecordIO-Protobuf.
Visual Anchors
Data Format Selection Logic
Row vs. Columnar Storage Layout
Definition-Example Pairs
- Parquet: A columnar storage format available to any project in the Hadoop ecosystem.
- Example: A financial company storing 50TB of stock transactions but only needing the
closing_pricecolumn for a prediction model. Using Parquet allows them to skip reading thetimestamp,volume, andtrader_idcolumns.
- Example: A financial company storing 50TB of stock transactions but only needing the
- Avro: A row-based remote procedure call and data serialization framework.
- Example: A microservice architecture where event schemas change over time. Avro allows the schema to be stored within the file so older consumers can still read new data correctly.
- JSON Lines (JSONL): A format where each line is a valid JSON object.
- Example: Training a SageMaker DeepAR forecasting model, which specifically accepts JSON Lines to handle time-series data chunks.
Worked Examples
Scenario: Optimizing an XGBoost Training Job
Problem: You have a 500GB dataset stored in S3 as a single CSV file. Your SageMaker XGBoost training job is taking too long to start and is very expensive.
Step-by-Step Breakdown:
- Analyze the format: CSV is row-based and cannot be split easily unless it is already partitioned into multiple files. SageMaker must download the entire 500GB before training starts.
- Evaluate Access Pattern: XGBoost can handle CSV or Parquet.
- Transformation: Convert the CSV to Parquet.
- Partitioning: Split the Parquet data into multiple smaller files (e.g., 100MB each) in S3.
- Result: SageMaker can use "Pipe Mode" or parallelized downloads. Columnar compression reduces the 500GB file size significantly, lowering S3 storage costs and speeding up I/O.
Checkpoint Questions
- Which data format is best for semi-structured, document-based data? (Answer: JSON)
- Why does Parquet provide better storage efficiency than CSV? (Answer: Column-specific compression and encoding techniques.)
- If you are using the SageMaker Image Classification algorithm, which format is most appropriate for high-speed ingestion? (Answer: RecordIO)
- True or False: Row-based formats like Avro are preferred for heavy analytical queries that only touch 5% of columns. (Answer: False; Columnar is preferred.)
Muddy Points & Cross-Refs
- Parquet vs. ORC: Both are columnar. Parquet is generally better supported across the entire AWS ecosystem (Athena, Redshift, SageMaker), while ORC is often slightly more efficient in Hive-specific environments.
- RecordIO Protobuf: This is a specific binary format used for SageMaker built-in algorithms. It is not easily human-readable, unlike CSV or JSON. For deeper study on ingestion, cross-reference with Amazon S3 Pipe Mode.
Comparison Tables
| Feature | CSV | JSON | Parquet / ORC | Avro |
|---|---|---|---|---|
| Storage Type | Row | Document | Columnar | Row |
| Human Readable | Yes | Yes | No | No |
| Schema Evolution | Poor | Flexible | Good | Excellent |
| Compression | Low | Low | Very High | High |
| Best Use Case | Small datasets | Metadata/Logs | Big Data/ML Training | Messaging/Logging |