Geek Logbook

Tech sea log book

Sending Events to Multiple PostHog Projects from the Same Website

In some architectures, a single website needs to send analytics events to multiple PostHog projects. This situation commonly appears in the following scenarios: PostHog supports this setup by allowing multiple instances of the JavaScript SDK to run simultaneously on the same website. How Multiple PostHog Instances Work PostHog allows initializing multiple instances of the SDK

Lambda vs n8n: A Simple Explanation for Data Workflows

Introduction When building data systems or integrating APIs, a common question appears: should we use AWS Lambda or n8n? Both tools can automate processes, call APIs, and move data between systems, but they are not the same thing and should not be used for the same purpose. The simplest way to understand the difference is

Should You Use AWS Lambda or AWS Glue to Update Records in HubSpot?

When integrating HubSpot with a data platform on AWS, a common architectural decision appears quickly: Should updates to HubSpot be executed from AWS Lambda or AWS Glue?The correct choice depends on workload characteristics, latency requirements, and system design principles. This article explains the decision from an architectural and data engineering perspective. The Nature of the

Can You Know the Location of an IPv6 Address?

Example IPv6: Short answer: only approximately, and with significant limitations. This article explains what can and cannot be inferred from an IPv6 address, the technical reasons behind those limitations, and how geolocation services actually work. 1. IPv6 Structure and Why It Matters An IPv6 address is 128 bits long and typically structured as: In practice:

HDFS vs. Object Storage: The Battle for Distributed Storage

Distributed storage has always been the foundation of Big Data. In the early days, Hadoop Distributed File System (HDFS) was the de facto standard. Today, however, object storage systems like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and MinIO are taking over. This shift reflects a broader change in how organizations

What Is a Data Lake and What Is a Data Lakehouse?

Over the last decade, the world of data architecture has gone through several transformations. From traditional data warehouses to Hadoop-based data lakes and now to the emerging Lakehouse paradigm, each stage represents a response to new challenges in scale, cost, and flexibility. But what exactly is a Data Lake, and how does a Data Lakehouse

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures. Hive (2008): SQL on

Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that are new or modified since the last load. This approach reduces latency, improves efficiency, and lowers infrastructure costs. When designing incremental loads, a common dilemma arises: should the

he Enduring Relevance of Peter Chen’s Entity-Relationship Model

In the landscape of data modeling, few contributions have had the long-lasting impact of Peter Chen’s Entity-Relationship (E-R) Model, introduced in 1976. More than four decades later, it remains a foundational framework for conceptualizing and designing data systems—bridging the gap between abstract business understanding and concrete database implementation. A Unified View of Data Chen’s model

How Hadoop Made Specialized Storage Hardware Obsolete

In the early 2000s, enterprise data processing was dominated by high-end hardware. Organizations relied heavily on centralized storage systems such as SAN (Storage Area Networks) and NAS (Network Attached Storage), typically connected to symmetric multiprocessing (SMP) servers or high-performance computing (HPC) clusters. These environments were expensive to scale, difficult to manage, and designed to avoid