Data – Geek Logbook

By - Geek Logbook
Posted on 2026-03-01
Posted in Data

Hardening OAuth Token Management in Postman: Preventing Environment Cross-Contamination

When working with multiple third-party APIs (Zoom, HubSpot, Meta, etc.), a common operational risk in Postman is environment cross-contamination. Tokens may be overwritten unintentionally if the wrong environment is active. This article describes a controlled, production-grade approach to managing OAuth tokens safely in Postman. The Core Problem If all environments share a variable named: and

By - Geek Logbook
Posted on 2026-02-27
Posted in Data

AWS Glue + Chargebee: Diagnosing CERTIFICATE_VERIFY_FAILED After TLS Chain Updates

Context An AWS Glue job that consumes the Chargebee API begins failing with: The same request works in Postman. This pattern typically appears after a certificate chain rotation on the API provider side combined with an outdated trust store in the execution environment. Chargebee announced updates related to its TLS certificate chain (DigiCert G2 becoming

By - Geek Logbook
Posted on 2026-02-26
Posted in Data

Why You Can’t Get Full Social Analytics from the HubSpot API (Even with Marketing Hub Pro)

Many teams assume that upgrading to Marketing Hub Professional unlocks full programmatic access to social media performance metrics. It does not. This article clarifies what is technically possible, what is not, and how to architect a reliable data pipeline for social analytics. The Core Limitation HubSpot allows you to: But HubSpot does not provide an

By - Geek Logbook
Posted on 2026-01-25
Posted in Data

Hiding Personal Information in AWS Glue with Spark

Protecting personal data before analytics consumption is a core requirement in modern data platforms. In AWS-based lake architectures, this is typically achieved through data de-identification during ingestion or transformation. This post outlines a practical and production-ready approach to hiding personal information using Spark jobs in AWS Glue. What “Hide Personal Information” Means in Data Engineering

By - Geek Logbook
Posted on 2025-10-01
Posted in Data

Modern Table Formats: Iceberg, Delta Lake, and Hudi

Data Lakes made it possible to store raw data at scale, but they lacked the reliability and governance of data warehouses. Files could be dropped into storage (S3, HDFS, MinIO), but analysts struggled with schema changes, updates, and deletes. To solve these issues, the community created modern table formats that brought ACID transactions, schema evolution,

By - Geek Logbook
Posted on 2025-09-30
Posted in Data

Trino in Modern Architectures: SQL Queries on S3 and MinIO

The rise of cloud object storage has transformed how organizations build data platforms. Hadoop Distributed File System (HDFS) once dominated, but today services like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and on-premise solutions like MinIO are the new foundation. In this shift, Trino has emerged as the query engine of

By - Geek Logbook
Posted on 2025-09-23
Posted in Data

Hive Metastore: The Glue Holding Big Data Together

When people think of Hive, they often remember the early days of Hadoop and MapReduce. But while Hive as a query engine has largely faded, one of its components remains critical to the modern data ecosystem: the Hive Metastore. This metadata service has become the backbone of Big Data platforms, powering not just Hive itself

By - Geek Logbook
Posted on 2025-09-23
Posted in Data

Why Parquet Became the Standard for Analytics

In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at scale. The analytics community needed a storage format that could reduce costs, improve query performance, and work across a diverse

By - Geek Logbook
Posted on 2025-09-19
Posted in Data

Facebook and Big Data: The Open Source Projects That Changed the Industry

When people talk about the history of Big Data, a few companies come to mind: Google, Yahoo, and Facebook. Each of them faced unique challenges that forced them to build large-scale distributed systems. While Google introduced foundational concepts like MapReduce and the Google File System (later inspiring Hadoop), Facebook had to deal with billions of

By - Geek Logbook
Posted on 2025-09-16
Posted in Data

Managing Evolving Schemas in Apache Spark: A Strategic Approach

Schema management is one of the most overlooked yet critical aspects of building reliable data pipelines. In a fast-moving environment, schemas rarely remain static: new fields are added, data types evolve, and nested structures become more complex. Relying on hard-coded schemas within Spark jobs may seem convenient at first, but it quickly turns into a

Geek Logbook

Recent Posts

Categories

Archives

Category: Data

Hardening OAuth Token Management in Postman: Preventing Environment Cross-Contamination

AWS Glue + Chargebee: Diagnosing CERTIFICATE_VERIFY_FAILED After TLS Chain Updates

Why You Can’t Get Full Social Analytics from the HubSpot API (Even with Marketing Hub Pro)

Hiding Personal Information in AWS Glue with Spark

Modern Table Formats: Iceberg, Delta Lake, and Hudi

Trino in Modern Architectures: SQL Queries on S3 and MinIO

Hive Metastore: The Glue Holding Big Data Together

Why Parquet Became the Standard for Analytics

Facebook and Big Data: The Open Source Projects That Changed the Industry

Managing Evolving Schemas in Apache Spark: A Strategic Approach