Geek Logbook

Tech sea log book

Why Parquet Became the Standard for Analytics

In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at scale. The analytics community needed a storage format that could reduce costs, improve query performance, and work across a diverse ecosystem of engines.

The answer came in the form of Apache Parquet, a columnar storage format that has since become the de facto standard for analytics and Lakehouse architectures.


The Problem with Row-Oriented Formats

Traditional formats like CSV or JSON store data row by row. This is ideal for transaction processing (OLTP) but inefficient for analytics (OLAP), where queries often scan only a few columns across billions of rows.

Example: CSV Storage

id, name, age, country
1, Alice, 30, US
2, Bob, 27, UK
3, Carol, 45, DE

If an analyst wants to compute the average age, the engine still needs to read every field (id, name, country), not just the age column. This wastes I/O, memory, and CPU.


The Columnar Advantage

Parquet is a columnar format, meaning it stores data by column rather than by row.

Benefits

  1. Read efficiency: Queries scan only the necessary columns.
    • Example: SELECT AVG(age) reads only the age column.
  2. Compression: Columnar data is highly compressible (values are often similar).
    • Saves storage and improves read speed.
  3. Encoding: Supports advanced encodings (dictionary, run-length, bit packing).
  4. Schema evolution: Allows adding/removing columns without rewriting entire datasets.
  5. Compatibility: Works with every modern engine: Spark, Trino, Hive, Flink, Presto, Athena, BigQuery, etc.

How Parquet Works

Parquet organizes data into:

  • Row groups → large chunks containing multiple rows.
  • Column chunks → within each row group, data is stored by column.
  • Pages → the smallest storage unit, often compressed.

This structure enables:

  • Predicate pushdown (filtering data before reading).
  • Column pruning (reading only needed columns).
  • Parallel reads across multiple row groups.

Example: CSV vs. Parquet Query

Imagine a dataset with 1 billion rows and 50 columns.

  • CSV: A query filtering by one column (e.g., WHERE country = 'US') still requires scanning all 50 columns.
  • Parquet: The engine reads only the country column, applies the filter, and then fetches the necessary rows.

Result: Orders-of-magnitude faster queries and much lower costs.


The Lakehouse Connection

Parquet plays a central role in Lakehouse architectures:

  • Data Lakes store raw data in Parquet to ensure efficient access.
  • Table formats like Iceberg, Delta Lake, and Hudi rely on Parquet as their underlying storage format.
  • Query engines like Trino, Spark, Athena, and BigQuery use Parquet natively, enabling SQL analytics directly on object storage.

Without Parquet (or ORC, its close relative), the Lakehouse model would not be practical.


Conclusion

  • CSV and JSON are convenient but inefficient at scale.
  • Parquet’s columnar design, compression, and schema evolution make it the perfect fit for analytics.
  • It is now the default format for modern Data Lakes and Lakehouses, ensuring performance, cost-efficiency, and broad compatibility.

In short: if your analytics data is not in Parquet, you are leaving performance and money on the table.

Tags: