Geek Logbook

Tech sea log book

Understanding Window Functions in SQL: Beyond Simple Aggregations

When we think about SQL functions, we often start with scalar functions (UPPER(), ROUND(), NOW()) or aggregate functions (SUM(), AVG(), COUNT()). But there is a third type that is essential for advanced analytics: window functions. The “Window”: The Metaphor Behind the Concept A window function is evaluated for every row, but not in isolation —

How to Set CloudWatch Log Retention Policies with Terraform

AWS CloudWatch is a powerful service for monitoring applications and infrastructure. However, by default, CloudWatch Logs are configured to never expire. This can lead to excessive storage costs and retention of data that you may not need. A better approach is to define a retention policy that aligns with your operational and compliance requirements. In

Orchestrating Multiple AWS Glue Workflows with Step Functions

In modern data architectures, it is common to manage multiple ETL pipelines that must run in sequence or in parallel. AWS Glue provides a robust framework for building workflows, but when we need to orchestrate two or more Glue Workflows together, AWS Step Functions becomes the natural choice. In this post, we will explain how

Understanding the Strategy Design Pattern

In the landscape of software design, maintaining flexibility and scalability is crucial. One of the most effective ways to achieve these qualities is by leveraging design patterns. Among the behavioral design patterns, the Strategy Pattern stands out as a powerful tool to manage algorithms dynamically. What is the Strategy Pattern? The Strategy Pattern allows you

How to Disable an AWS Glue Trigger from the CLI

When working with AWS Glue, triggers are an important mechanism to orchestrate jobs or workflows. Sometimes, however, you may need to temporarily disable a trigger without deleting it—for example, to pause scheduled ingestions during maintenance or testing. This article explains how to disable a trigger using the AWS CLI. Understanding AWS Glue Triggers AWS Glue

Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

When working with PySpark, one of the first commands developers use to quickly inspect data is: However, in certain environments (especially when running inside PyCharm or VSCode with a debugger), you may encounter a warning like the following: At first glance, this message looks like an error in Spark itself, but in reality it is

Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that are new or modified since the last load. This approach reduces latency, improves efficiency, and lowers infrastructure costs. When designing incremental loads, a common dilemma arises: should the

Running Apache Airflow Across Environments

Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when moving between managed cloud services, local setups, and containerized deployments. This post provides a clear comparison of how Airflow operates in different contexts: 1. Airflow on