Trino in Modern Architectures: SQL Queries on S3 and MinIO
The rise of cloud object storage has transformed how organizations build data platforms. Hadoop Distributed File System (HDFS) once dominated, but today services like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and on-premise solutions like MinIO are the new foundation.
In this shift, Trino has emerged as the query engine of choice for running SQL directly on object storage. It provides interactive, federated queries across structured and semi-structured data—without the need to move it into a traditional data warehouse.
What Is Trino?
Trino is a distributed SQL query engine originally developed at Facebook (as Presto). It is designed for:
- Interactive queries at scale (seconds, not hours).
- Federated analytics across multiple sources (S3, MySQL, Kafka, Elasticsearch, etc.).
- Lakehouse integration with modern table formats (Iceberg, Delta Lake, Hudi).
Unlike data warehouses, Trino doesn’t store data. Instead, it reads directly from the source.
Why Trino + Object Storage Works
Modern object storage systems (S3, MinIO, GCS, ADLS) are:
- Cheap (pay-per-use or commodity hardware).
- Scalable (virtually unlimited capacity).
- Flexible (can store structured, semi-structured, and unstructured data).
But raw storage is not enough—users need SQL analytics. That’s where Trino fits in:
- Reads directly from Parquet/ORC in S3 or MinIO.
- Uses a catalog (Hive Metastore, Glue, Nessie, REST) to interpret schemas.
- Executes SQL queries with column pruning and predicate pushdown for efficiency.
Example: Querying Data on MinIO with Trino
Imagine you have sales data stored in MinIO in Parquet format:
minio://analytics/sales/year=2025/month=09/day=22/part-0001.parquet
You register the dataset in a catalog (e.g., Hive Metastore or Iceberg REST). Then you can query it in Trino:
SELECT country, SUM(amount) AS revenue
FROM sales
WHERE year = 2025 AND month = 09
GROUP BY country;
The query engine automatically:
- Reads only the
country
andamount
columns (column pruning). - Scans only the relevant partitions (year=2025, month=09).
- Pushes filters down to the storage layer.
Trino in a Multi-Source Architecture
One of Trino’s greatest strengths is federation. You can run queries across multiple systems at once:
SELECT u.id, u.name, SUM(s.amount) AS total_spent
FROM mysql.users u
JOIN sales s
ON u.id = s.user_id
WHERE s.year = 2025
GROUP BY u.id, u.name;
Here, Trino queries MySQL for user data and MinIO (Parquet/Iceberg) for sales data—seamlessly, with standard SQL.
S3 vs. MinIO: Cloud and On-Prem
- S3: The standard in the cloud, fully managed, elastic.
- MinIO: A lightweight, self-hosted, S3-compatible alternative for on-premise or hybrid setups.
- Both work identically with Trino because Trino speaks the S3 API.
This flexibility allows organizations to build Lakehouses on any infrastructure.
Benefits of Trino + Object Storage
- Interactive analytics: Run SQL on raw data in seconds.
- No ETL into a warehouse: Query data where it lives.
- Cost efficiency: Pay only for storage and compute.
- Future-proof: Works with modern formats (Parquet, ORC, Iceberg, Delta, Hudi).
- Federation: Query multiple systems in one place.
Conclusion
Trino has become the engine that unlocks the value of object storage. By combining SQL interactivity with the scalability of S3 or MinIO, it eliminates the old boundaries between lakes and warehouses.
In today’s Lakehouse architectures, Trino is no longer just “Presto for Hadoop”—it is the SQL layer for the cloud-native data stack.