September 2025 – Geek Logbook

By - Geek Logbook
Posted on 2025-09-30
Posted in Data

Trino in Modern Architectures: SQL Queries on S3 and MinIO

The rise of cloud object storage has transformed how organizations build data platforms. Hadoop Distributed File System (HDFS) once dominated, but today services like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and on-premise solutions like MinIO are the new foundation. In this shift, Trino has emerged as the query engine of

By - Geek Logbook
Posted on 2025-09-23
Posted in Data

Hive Metastore: The Glue Holding Big Data Together

When people think of Hive, they often remember the early days of Hadoop and MapReduce. But while Hive as a query engine has largely faded, one of its components remains critical to the modern data ecosystem: the Hive Metastore. This metadata service has become the backbone of Big Data platforms, powering not just Hive itself

By - Geek Logbook
Posted on 2025-09-23
Posted in Data

Why Parquet Became the Standard for Analytics

In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at scale. The analytics community needed a storage format that could reduce costs, improve query performance, and work across a diverse

By - Geek Logbook
Posted on 2025-09-19
Posted in Notes

HDFS vs. Object Storage: The Battle for Distributed Storage

Distributed storage has always been the foundation of Big Data. In the early days, Hadoop Distributed File System (HDFS) was the de facto standard. Today, however, object storage systems like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and MinIO are taking over. This shift reflects a broader change in how organizations

By - Geek Logbook
Posted on 2025-09-19
Posted in Notes

What Is a Data Lake and What Is a Data Lakehouse?

Over the last decade, the world of data architecture has gone through several transformations. From traditional data warehouses to Hadoop-based data lakes and now to the emerging Lakehouse paradigm, each stage represents a response to new challenges in scale, cost, and flexibility. But what exactly is a Data Lake, and how does a Data Lakehouse

By - Geek Logbook
Posted on 2025-09-19
Posted in Notes

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures. Hive (2008): SQL on

By - Geek Logbook
Posted on 2025-09-19
Posted in Data

Facebook and Big Data: The Open Source Projects That Changed the Industry

When people talk about the history of Big Data, a few companies come to mind: Google, Yahoo, and Facebook. Each of them faced unique challenges that forced them to build large-scale distributed systems. While Google introduced foundational concepts like MapReduce and the Google File System (later inspiring Hadoop), Facebook had to deal with billions of

By - Geek Logbook
Posted on 2025-09-172025-09-17
Posted in Architectures

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

When choosing a NoSQL database for scalable, low-latency applications, two major options stand out: Google Cloud Bigtable and Amazon DynamoDB. While both are managed, highly available, and horizontally scalable, they are designed with different models and use cases in mind. 1. Data Model Google Bigtable: Amazon DynamoDB: 2. Query Capabilities Bigtable: DynamoDB: 3. Scalability and

By - Geek Logbook
Posted on 2025-09-17
Posted in Others

How to Keep a Docker Container Running Persistently

When working with Docker, you may have noticed that some containers stop as soon as you exit the shell. This is because Docker considers the container’s main process to have finished. In this post, we will explain why this happens and how to keep your container running persistently, so you can reconnect whenever you need.

By - Geek Logbook
Posted on 2025-09-162025-09-16
Posted in Cloud

Orchestrating Multiple AWS Glue Workflows: A Practical Guide

AWS Glue provides a robust environment for building and managing ETL pipelines, but many data engineers face the challenge of chaining or coordinating multiple workflows. This article explores practical approaches to relate two or more Glue workflows, covering both native features and complementary AWS services. Why You Might Need Multiple Workflows In many data engineering

Geek Logbook

Recent Posts

Categories

Archives

Month: September 2025

Trino in Modern Architectures: SQL Queries on S3 and MinIO

Hive Metastore: The Glue Holding Big Data Together

Why Parquet Became the Standard for Analytics

HDFS vs. Object Storage: The Battle for Distributed Storage

What Is a Data Lake and What Is a Data Lakehouse?

The History of Hive and Trino: From Hadoop to Lakehouses

Facebook and Big Data: The Open Source Projects That Changed the Industry

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

How to Keep a Docker Container Running Persistently

Orchestrating Multiple AWS Glue Workflows: A Practical Guide