Geek Logbook

By - Geek Logbook
Posted on 2025-09-19
Posted in Notes

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures. Hive (2008): SQL on

By - Geek Logbook
Posted on 2025-09-19
Posted in Data

Facebook and Big Data: The Open Source Projects That Changed the Industry

When people talk about the history of Big Data, a few companies come to mind: Google, Yahoo, and Facebook. Each of them faced unique challenges that forced them to build large-scale distributed systems. While Google introduced foundational concepts like MapReduce and the Google File System (later inspiring Hadoop), Facebook had to deal with billions of

By - Geek Logbook
Posted on 2025-09-172025-09-17
Posted in Architectures

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

When choosing a NoSQL database for scalable, low-latency applications, two major options stand out: Google Cloud Bigtable and Amazon DynamoDB. While both are managed, highly available, and horizontally scalable, they are designed with different models and use cases in mind. 1. Data Model Google Bigtable: Amazon DynamoDB: 2. Query Capabilities Bigtable: DynamoDB: 3. Scalability and

By - Geek Logbook
Posted on 2025-09-17
Posted in Others

How to Keep a Docker Container Running Persistently

When working with Docker, you may have noticed that some containers stop as soon as you exit the shell. This is because Docker considers the container’s main process to have finished. In this post, we will explain why this happens and how to keep your container running persistently, so you can reconnect whenever you need.

By - Geek Logbook
Posted on 2025-09-162025-09-16
Posted in Cloud

Orchestrating Multiple AWS Glue Workflows: A Practical Guide

AWS Glue provides a robust environment for building and managing ETL pipelines, but many data engineers face the challenge of chaining or coordinating multiple workflows. This article explores practical approaches to relate two or more Glue workflows, covering both native features and complementary AWS services. Why You Might Need Multiple Workflows In many data engineering

By - Geek Logbook
Posted on 2025-09-16
Posted in Data

Managing Evolving Schemas in Apache Spark: A Strategic Approach

Schema management is one of the most overlooked yet critical aspects of building reliable data pipelines. In a fast-moving environment, schemas rarely remain static: new fields are added, data types evolve, and nested structures become more complex. Relying on hard-coded schemas within Spark jobs may seem convenient at first, but it quickly turns into a

By - Geek Logbook
Posted on 2025-09-16
Posted in Cloud

Secure Ways to Share Private Data on AWS: Beyond Public Buckets

When building data platforms in the cloud, it is common to share data with partners, clients, or internal teams outside your own. AWS provides several mechanisms to grant secure, granular access — far beyond the simple (and risky) “make the bucket public” approach. In this post, we will explore the main strategies for sharing data

By - Geek Logbook
Posted on 2025-09-16
Posted in Others

Fixing Cursor Login Issues on Linux (AppImage)

When running Cursor on Linux, especially with the AppImage version, you might encounter a situation where you can’t log in. This usually happens because Cursor stores its session state locally, and sometimes that state gets corrupted. In this article, we’ll walk through how to diagnose the issue and reset your session state without losing your

By - Geek Logbook
Posted on 2025-09-15
Posted in Programming

Querying JSONB in PostgreSQL Efficiently

In modern applications, it is common to store semi-structured data in JSON format inside a relational database like PostgreSQL. However, to analyze this data properly, you need a way to transform it into a tabular structure that can be queried with standard SQL. In this article, we will demonstrate a real-world example of reading a

By - Geek Logbook
Posted on 2025-09-15
Posted in Architectures

Designing a Semantic Layer for Athena + Power BI

Modern data architectures benefit from a clear separation of layers: Ingesta, Staging, and Semantic (Presentation). When using Amazon Athena as the query engine and Power BI as the visualization tool, this layered approach enables scalability, governance, and cost control. 1. Ingesta (Raw Layer) Purpose: Store data exactly as it arrives from source systems, preserving fidelity.

Recent Posts

Categories

Archives

The History of Hive and Trino: From Hadoop to Lakehouses

Facebook and Big Data: The Open Source Projects That Changed the Industry

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

How to Keep a Docker Container Running Persistently

Orchestrating Multiple AWS Glue Workflows: A Practical Guide

Managing Evolving Schemas in Apache Spark: A Strategic Approach

Secure Ways to Share Private Data on AWS: Beyond Public Buckets

Fixing Cursor Login Issues on Linux (AppImage)

Querying JSONB in PostgreSQL Efficiently

Designing a Semantic Layer for Athena + Power BI