Geek Logbook

Tech sea log book

HDFS vs. Object Storage: The Battle for Distributed Storage

Distributed storage has always been the foundation of Big Data. In the early days, Hadoop Distributed File System (HDFS) was the de facto standard. Today, however, object storage systems like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and MinIO are taking over. This shift reflects a broader change in how organizations

What Is a Data Lake and What Is a Data Lakehouse?

Over the last decade, the world of data architecture has gone through several transformations. From traditional data warehouses to Hadoop-based data lakes and now to the emerging Lakehouse paradigm, each stage represents a response to new challenges in scale, cost, and flexibility. But what exactly is a Data Lake, and how does a Data Lakehouse

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures. Hive (2008): SQL on

Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that are new or modified since the last load. This approach reduces latency, improves efficiency, and lowers infrastructure costs. When designing incremental loads, a common dilemma arises: should the

he Enduring Relevance of Peter Chen’s Entity-Relationship Model

In the landscape of data modeling, few contributions have had the long-lasting impact of Peter Chen’s Entity-Relationship (E-R) Model, introduced in 1976. More than four decades later, it remains a foundational framework for conceptualizing and designing data systems—bridging the gap between abstract business understanding and concrete database implementation. A Unified View of Data Chen’s model

How Hadoop Made Specialized Storage Hardware Obsolete

In the early 2000s, enterprise data processing was dominated by high-end hardware. Organizations relied heavily on centralized storage systems such as SAN (Storage Area Networks) and NAS (Network Attached Storage), typically connected to symmetric multiprocessing (SMP) servers or high-performance computing (HPC) clusters. These environments were expensive to scale, difficult to manage, and designed to avoid

EMR vs AWS Glue: Choosing the Right Data Processing Tool on AWS

When working with big data on AWS, two commonly used services for data processing are Amazon EMR and AWS Glue. Although both support scalable data transformation and analytics, they differ significantly in architecture, control, use cases, and cost models. Choosing the right tool depends on your specific workload, performance needs, and operational preferences. In this

How Google Changed Big Data: The Story of GFS, MapReduce, and Bigtable

In the early 2000s, Google faced a unique challenge: how to store, process, and query massive amounts of data across thousands of unreliable machines. The traditional systems of the time—designed for a world of smaller datasets and centralized infrastructure—simply couldn’t keep up. Google responded by designing an entirely new architecture. It wasn’t just about solving

ecure Database Access in AWS Using SSH Tunneling

Accessing databases located in private subnets within AWS Virtual Private Clouds (VPCs) is a common requirement in enterprise architectures. To ensure secure connectivity without exposing the database to the public internet, developers and operations engineers often employ SSH tunneling via a bastion host. Background Databases in a private subnet cannot be accessed directly from external