Data Transformation Mastery: From CSV to Parquet
Transforming data between formats (for example, .csv to .parquet)
Data Transformation Mastery: From CSV to Parquet
This guide covers the essential techniques and AWS services used to transform data from raw, row-based formats (like .csv) into optimized, columnar-based formats (like .parquet). This process is critical for building cost-effective and high-performance data lakes on AWS.
Learning Objectives
- Identify the key AWS services used for data transformation (AWS Glue, Amazon EMR, Kinesis Data Firehose).
- Explain the technical and financial benefits of converting row-based data to columnar formats.
- Understand the role of ETL (Extract, Transform, Load) in the data lifecycle.
- Recognize when to use specialized features like Glue's FindMatches ML for data cleaning.
Key Terms & Glossary
- ETL (Extract, Transform, Load): The process of pulling data from sources, changing its format or structure, and loading it into a destination (like a Data Lake).
- CSV (Comma Separated Values): A row-based, plain-text data format. Easy to read but inefficient for large-scale queries.
- Parquet: An open-source, columnar storage format that provides efficient data compression and encoding schemes.
- AWS Glue: A fully managed, serverless ETL service that makes it easy to categorize, clean, and transform data.
- Schema-on-Read: A data analysis strategy where the structure (schema) is applied to raw data only when it is queried, rather than when it is stored.
The "Big Idea"
Data transformation is the bridge between Data Ingestion (getting the data) and Data Analytics (getting value). By transforming data into formats like Parquet, you optimize for the Content Domain 4: Design Cost-Optimized Architectures. Querying 1TB of CSV data in Athena is significantly more expensive and slower than querying the same 1TB of data compressed and converted to Parquet, as Athena only reads the specific columns required by your query.
Formula / Concept Box
| Feature | Row-Based (CSV/JSON) | Columnar-Based (Parquet/Avro) |
|---|---|---|
| Storage Efficiency | Low (Text-heavy) | High (Binary compression) |
| Query Performance | Slower (Reads entire row) | Faster (Reads specific columns) |
| Cost (Athena/S3) | Higher (More data scanned) | Lower (Less data scanned) |
| Use Case | Data Ingestion / Logging | Data Warehousing / Analytics |
Hierarchical Outline
- I. Data Ingestion Phase
- Sources: RDS, On-premises JDBC, S3, Kinesis Streams.
- Goal: Get raw data into the "Landing Zone" S3 bucket.
- II. Data Transformation Phase (The "T" in ETL)
- AWS Glue: Serverless Spark jobs for heavy lifting.
- Cleaning: Removing duplicates via FindMatches ML.
- Normalization: Imposing consistent timestamps (e.g., converting all to UTC).
- Format Conversion: Changing
.csvor.jsonto.parquetor.avro.
- III. Storage and Optimization
- S3 Data Lake: Storing the transformed data in a "Processed" bucket.
- Partitioning: Organizing data by keys (e.g.,
/year=/month=/day/) to limit data scans.
- IV. Analytics Consumption
- Amazon Athena: Querying S3 data using SQL.
- Amazon QuickSight: Visualizing the results.
Visual Anchors
The ETL Pipeline Flow
Row vs. Columnar Storage Logic
\begin{tikzpicture}[scale=0.8] % Row-based representation \draw[fill=blue!10] (0,4) rectangle (6,5) node[midway] {Row 1: ID, Name, Date, Amount}; \draw[fill=blue!10] (0,3) rectangle (6,4) node[midway] {Row 2: ID, Name, Date, Amount}; \draw[fill=blue!10] (0,2) rectangle (6,3) node[midway] {Row 3: ID, Name, Date, Amount}; \node at (3,5.5) {\textbf{Row-Based (CSV)}};
% Column-based representation
\draw[fill=green!10] (8,2) rectangle (9,5) node[midway, rotate=90] {IDs};
\draw[fill=green!10] (9.5,2) rectangle (10.5,5) node[midway, rotate=90] {Names};
\draw[fill=green!10] (11,2) rectangle (12,5) node[midway, rotate=90] {Dates};
\draw[fill=green!10] (12.5,2) rectangle (13.5,5) node[midway, rotate=90] {Amounts};
\node at (10.75,5.5) {\textbf{Columnar (Parquet)}};
% Highlight why Columnar is better for specific queries
\draw[red, thick, dashed] (12.4,1.8) rectangle (13.6,5.2);
\node[red, font=\scriptsize] at (13, 1.5) {Query only this column};\end{tikzpicture}
Definition-Example Pairs
- Deduplication: The process of identifying and removing repeating data points to save cost.
- Example: Using FindMatches ML in Glue to realize that "John Smith" in the Sales DB and "J. Smith" in the Marketing DB are the same person and merging them.
- Compression: Reducing the file size without losing information.
- Example: Converting a 100MB CSV log file to a 20MB Parquet file, which saves 80% on S3 storage costs.
- Partitioning: Dividing a dataset into discrete parts based on column values.
- Example: Organizing logs in S3 as
s3://bucket/logs/region=us-east-1/so that a query about US-East data never even touches the files for Europe.
- Example: Organizing logs in S3 as
Worked Example: Converting S3 Logs
Scenario: A company stores 500GB of web server logs in S3 in .csv format. They use Amazon Athena to run daily reports. The bills are high because Athena scans the full 500GB every time.
Step-by-Step Solution:
- Crawler: Run an AWS Glue Crawler on the raw S3 bucket to automatically infer the schema and populate the Glue Data Catalog.
- Glue Job: Create a Spark ETL job. In the visual editor, select the Source (CSV bucket) and the Target (a new S3 bucket).
- Transformation: In the job settings, change the output format from
CSVtoParquet. - Partitioning: Set the job to partition the data by
customer_idandevent_date. - Execution: Run the job. Glue will read the CSVs, convert them to binary Parquet, and write them to the target bucket.
- Result: When Athena queries the new bucket, it only scans the specific columns needed (e.g., just
status_code), reducing data scanned by up to 90%.
Checkpoint Questions
- Which AWS service is specifically designed to perform serverless ETL and convert data formats?
- Why is Parquet more cost-effective than CSV for use with Amazon Athena?
- What is the benefit of using FindMatches ML over standard SQL
DISTINCTqueries? - True or False: A Data Lake requires data to be cleaned and structured before it can be stored in S3.
- Which service would you use to visualize the data once it has been transformed and queried by Athena?
[!TIP] In the SAA-C03 exam, if you see a requirement to "reduce cost for Athena queries" or "optimize data for analytics," the answer almost always involves converting data to Parquet using AWS Glue.