Geek Logbook

Tech sea log book

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures.


Hive (2008): SQL on Hadoop

In the mid-2000s, Facebook’s engineers were drowning in data. Hadoop had become the storage and processing backbone, but working with MapReduce jobs directly was slow and complex. Data analysts, who were more familiar with SQL, couldn’t easily access Hadoop.

The solution was Hive, introduced in 2008. Hive added:

  • A SQL-like query language (HiveQL) that analysts could use.
  • A compiler that translated HiveQL into MapReduce jobs on Hadoop.
  • A Metastore to manage table definitions, partitions, and schemas.

Hive democratized access to Big Data inside Facebook. Instead of writing complex MapReduce programs, analysts could run queries like:

SELECT user_id, COUNT(*) 
FROM page_views 
WHERE date = '2008-11-01' 
GROUP BY user_id;

Problem: Hive was batch-oriented. Queries often took minutes or hours, making it unsuitable for interactive analysis.


Presto (2012): Interactive SQL at Scale

By 2012, Facebook needed faster analytics. Analysts were frustrated by the latency of Hive, especially when exploring data. The engineering team built a new query engine from scratch: Presto.

Key innovations in Presto:

  • Distributed, in-memory execution (no MapReduce).
  • Interactive response times (seconds, not hours).
  • Federated queries: Presto could query not only Hadoop, but also MySQL, Cassandra, and other systems with the same SQL syntax.

This meant that instead of waiting for batch jobs, analysts could now write:

SELECT COUNT(*) 
FROM hive.page_views p
JOIN mysql.users u 
ON p.user_id = u.id;

And Presto would execute the query across Hadoop and MySQL simultaneously.

Presto quickly became Facebook’s main interactive query engine, replacing Hive for most day-to-day analysis.


From Presto to Trino (2019)

In 2019, the original creators of Presto left Facebook to continue development under a new name: Trino (originally called PrestoSQL). Meanwhile, the version inside Facebook remained as PrestoDB.

Today:

  • Trino is the community-driven project with the fastest evolution, powering modern Lakehouse systems.
  • PrestoDB is still maintained by Facebook/Meta but has a smaller ecosystem.

Trino has extended beyond Hadoop:

  • Native connectors for S3, GCS, ADLS, MinIO.
  • Direct support for modern table formats: Iceberg, Delta Lake, Hudi.
  • Federation across traditional databases and warehouses.

Hive vs. Trino in the Lakehouse Era

FeatureHive (2008)Trino (2012 → now)
Execution modelBatch (MapReduce, later Tez/Spark)Interactive, in-memory, distributed
LatencyMinutes to hoursSeconds
Storage targetHDFSHDFS, S3, MinIO, GCS, ADLS
MetadataHive MetastoreHive Metastore, Glue, Nessie, REST catalogs
Use caseBatch ETL, long-running queriesInteractive queries, federated analytics, Lakehouse

Conclusion

  • Hive made SQL possible on Hadoop and introduced the concept of a metadata catalog that is still vital today.
  • Trino revolutionized the space by enabling fast, federated, and interactive queries at massive scale.
  • Together, they represent two generations of Big Data engines: Hive for the Hadoop era, Trino for the Lakehouse era.

Modern data platforms no longer rely on HDFS or MapReduce, but the DNA of Hive and Trino lives on in every query executed against S3, MinIO, or Iceberg tables.

Tags: