Geek Logbook

Tech sea log book

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures. Hive (2008): SQL on

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

When choosing a NoSQL database for scalable, low-latency applications, two major options stand out: Google Cloud Bigtable and Amazon DynamoDB. While both are managed, highly available, and horizontally scalable, they are designed with different models and use cases in mind. 1. Data Model Google Bigtable: Amazon DynamoDB: 2. Query Capabilities Bigtable: DynamoDB: 3. Scalability and

How to Keep a Docker Container Running Persistently

When working with Docker, you may have noticed that some containers stop as soon as you exit the shell. This is because Docker considers the container’s main process to have finished. In this post, we will explain why this happens and how to keep your container running persistently, so you can reconnect whenever you need.

Orchestrating Multiple AWS Glue Workflows: A Practical Guide

AWS Glue provides a robust environment for building and managing ETL pipelines, but many data engineers face the challenge of chaining or coordinating multiple workflows. This article explores practical approaches to relate two or more Glue workflows, covering both native features and complementary AWS services. Why You Might Need Multiple Workflows In many data engineering

Managing Evolving Schemas in Apache Spark: A Strategic Approach

Schema management is one of the most overlooked yet critical aspects of building reliable data pipelines. In a fast-moving environment, schemas rarely remain static: new fields are added, data types evolve, and nested structures become more complex. Relying on hard-coded schemas within Spark jobs may seem convenient at first, but it quickly turns into a

Secure Ways to Share Private Data on AWS: Beyond Public Buckets

When building data platforms in the cloud, it is common to share data with partners, clients, or internal teams outside your own. AWS provides several mechanisms to grant secure, granular access — far beyond the simple (and risky) “make the bucket public” approach. In this post, we will explore the main strategies for sharing data

Fixing Cursor Login Issues on Linux (AppImage)

When running Cursor on Linux, especially with the AppImage version, you might encounter a situation where you can’t log in. This usually happens because Cursor stores its session state locally, and sometimes that state gets corrupted. In this article, we’ll walk through how to diagnose the issue and reset your session state without losing your

Querying JSONB in PostgreSQL Efficiently

In modern applications, it is common to store semi-structured data in JSON format inside a relational database like PostgreSQL. However, to analyze this data properly, you need a way to transform it into a tabular structure that can be queried with standard SQL. In this article, we will demonstrate a real-world example of reading a

Designing a Semantic Layer for Athena + Power BI

Modern data architectures benefit from a clear separation of layers: Ingesta, Staging, and Semantic (Presentation). When using Amazon Athena as the query engine and Power BI as the visualization tool, this layered approach enables scalability, governance, and cost control. 1. Ingesta (Raw Layer) Purpose: Store data exactly as it arrives from source systems, preserving fidelity.