Geek Logbook

Tech sea log book

Controlling Branch Deployments and Redirects in Vercel: A Practical Guide

Continuous deployment platforms simplify the release process, but they can easily become noisy when every branch triggers a build. Teams working with multiple development environments often need finer control — building only when specific branches are updated and ignoring the rest. The Problem Imagine a development team maintaining three main branches: By default, Vercel automatically

Estimating the Cost of an AWS Glue Workflow

When working with AWS Glue, one of the most common questions data engineers ask is: How much will this job cost me? If you have a workflow that runs for 13 minutes, understanding the cost model of AWS Glue helps you avoid surprises on your AWS bill. How AWS Glue Pricing Works AWS Glue pricing

AWS EventBridge Rules vs EventBridge Scheduler: Which One Should You Use?

In the AWS ecosystem, there are two main ways to schedule and automate tasks: EventBridge Rules (scheduled rules) and the newer EventBridge Scheduler, which introduces Schedule Groups. While both can trigger actions at defined times, their design, scalability, and flexibility differ significantly. Choosing the right option depends on your workload requirements. 1. What Are EventBridge

Running Production Servers on AWS: EC2 vs RDS Cost Breakdown

When planning to run production workloads in the cloud, cost is one of the most important considerations. In this post, we will explore the monthly expenses of running two application servers and a database server on AWS, and compare two deployment approaches: EC2-only vs EC2 + RDS. Infrastructure Requirements Our baseline infrastructure looks like this:

Modern Table Formats: Iceberg, Delta Lake, and Hudi

Data Lakes made it possible to store raw data at scale, but they lacked the reliability and governance of data warehouses. Files could be dropped into storage (S3, HDFS, MinIO), but analysts struggled with schema changes, updates, and deletes. To solve these issues, the community created modern table formats that brought ACID transactions, schema evolution,

Trino in Modern Architectures: SQL Queries on S3 and MinIO

The rise of cloud object storage has transformed how organizations build data platforms. Hadoop Distributed File System (HDFS) once dominated, but today services like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and on-premise solutions like MinIO are the new foundation. In this shift, Trino has emerged as the query engine of

Hive Metastore: The Glue Holding Big Data Together

When people think of Hive, they often remember the early days of Hadoop and MapReduce. But while Hive as a query engine has largely faded, one of its components remains critical to the modern data ecosystem: the Hive Metastore. This metadata service has become the backbone of Big Data platforms, powering not just Hive itself

Why Parquet Became the Standard for Analytics

In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at scale. The analytics community needed a storage format that could reduce costs, improve query performance, and work across a diverse

HDFS vs. Object Storage: The Battle for Distributed Storage

Distributed storage has always been the foundation of Big Data. In the early days, Hadoop Distributed File System (HDFS) was the de facto standard. Today, however, object storage systems like Amazon S3, Google Cloud Storage (GCS), Azure Data Lake Storage (ADLS), and MinIO are taking over. This shift reflects a broader change in how organizations

What Is a Data Lake and What Is a Data Lakehouse?

Over the last decade, the world of data architecture has gone through several transformations. From traditional data warehouses to Hadoop-based data lakes and now to the emerging Lakehouse paradigm, each stage represents a response to new challenges in scale, cost, and flexibility. But what exactly is a Data Lake, and how does a Data Lakehouse