Geek Logbook

Tech sea log book

Trino in Modern Architectures: SQL Queries on S3 and MinIO

The rise of cloud object storage has transformed how organizations build data platforms. Hadoop Distributed File System (HDFS) once dominated, but today services like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and on-premise solutions like MinIO are the new foundation. In this shift, Trino has emerged as the query engine of

Hive Metastore: The Glue Holding Big Data Together

When people think of Hive, they often remember the early days of Hadoop and MapReduce. But while Hive as a query engine has largely faded, one of its components remains critical to the modern data ecosystem: the Hive Metastore. This metadata service has become the backbone of Big Data platforms, powering not just Hive itself

Why Parquet Became the Standard for Analytics

In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at scale. The analytics community needed a storage format that could reduce costs, improve query performance, and work across a diverse

HDFS vs. Object Storage: The Battle for Distributed Storage

Distributed storage has always been the foundation of Big Data. In the early days, Hadoop Distributed File System (HDFS) was the de facto standard. Today, however, object storage systems like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and MinIO are taking over. This shift reflects a broader change in how organizations

What Is a Data Lake and What Is a Data Lakehouse?

Over the last decade, the world of data architecture has gone through several transformations. From traditional data warehouses to Hadoop-based data lakes and now to the emerging Lakehouse paradigm, each stage represents a response to new challenges in scale, cost, and flexibility. But what exactly is a Data Lake, and how does a Data Lakehouse

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures. Hive (2008): SQL on

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

When choosing a NoSQL database for scalable, low-latency applications, two major options stand out: Google Cloud Bigtable and Amazon DynamoDB. While both are managed, highly available, and horizontally scalable, they are designed with different models and use cases in mind. 1. Data Model Google Bigtable: Amazon DynamoDB: 2. Query Capabilities Bigtable: DynamoDB: 3. Scalability and

How to Keep a Docker Container Running Persistently

When working with Docker, you may have noticed that some containers stop as soon as you exit the shell. This is because Docker considers the container’s main process to have finished. In this post, we will explain why this happens and how to keep your container running persistently, so you can reconnect whenever you need.

Orchestrating Multiple AWS Glue Workflows: A Practical Guide

AWS Glue provides a robust environment for building and managing ETL pipelines, but many data engineers face the challenge of chaining or coordinating multiple workflows. This article explores practical approaches to relate two or more Glue workflows, covering both native features and complementary AWS services. Why You Might Need Multiple Workflows In many data engineering