Geek Logbook

Tech sea log book

Understanding Window Functions in SQL: Beyond Simple Aggregations

When we think about SQL functions, we often start with scalar functions (UPPER(), ROUND(), NOW()) or aggregate functions (SUM(), AVG(), COUNT()). But there is a third type that is essential for advanced analytics: window functions. The “Window”: The Metaphor Behind the Concept A window function is evaluated for every row, but not in isolation —

Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

When working with PySpark, one of the first commands developers use to quickly inspect data is: However, in certain environments (especially when running inside PyCharm or VSCode with a debugger), you may encounter a warning like the following: At first glance, this message looks like an error in Spark itself, but in reality it is

Running Apache Airflow Across Environments

Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when moving between managed cloud services, local setups, and containerized deployments. This post provides a clear comparison of how Airflow operates in different contexts: 1. Airflow on

Optimizing Partition Strategies in Apache Iceberg on AWS

When working with large-scale analytical datasets, efficient partitioning is critical for achieving optimal query performance and cost savings. Apache Iceberg, a modern table format designed for big data, offers powerful partitioning capabilities. One common design decision is whether to use a single date column (e.g., yyyymmdd) or separate columns for year, month, and day (year,

How Transactions Work in Databricks Using Delta Lake

Databricks is a powerful platform for big data analytics and machine learning. One of its key features is the ability to run transactional workloads over large-scale data lakes using Delta Lake. This post explores how transactions are supported in Databricks and how you can use them to ensure data consistency and integrity. What Are Transactions

AWS Glue Workflow vs Apache Airflow: A Professional Comparison

While both serve the common purpose of managing and automating data workflows, they differ significantly in architecture, flexibility, integration capabilities, and operational control. This article offers a comprehensive and professional comparison of AWS Glue Workflow and Apache Airflow to help data engineers, architects, and decision-makers choose the most suitable tool for their use case. 1.

Reducing AWS Costs: How to Temporarily Stop an Aurora Serverless v2 Cluster

When managing cloud infrastructure, minimizing costs without compromising data integrity is a continuous priority. Amazon Aurora Serverless v2 offers scalability and high availability, but unlike traditional RDS instances, it introduces nuances in how compute resources are billed. One common question arises: Can an Aurora Serverless v2 database be stopped to save costs? Understanding Aurora Serverless

The Origin and Evolution of the DataFrame

When working with data today—whether in Python, R, or distributed computing platforms like Spark—one of the most commonly used structures is the DataFrame. But where did it come from? This post explores the origin, evolution, and growing importance of the DataFrame in data science and analytics. What is a DataFrame? A DataFrame is a two-dimensional

Are NoSQL Databases Really Schema-less?

A Perspective from the MERN Stack When we first start learning about NoSQL databases, one of the most common things we hear is that they are “schema-less.” At first glance, this seems like a huge advantage: total flexibility, the ability to adapt quickly, and storage that isn’t bound by strict rules. But when we dive