Geek Logbook

Tech sea log book

How to Disable an AWS Glue Trigger from the CLI

When working with AWS Glue, triggers are an important mechanism to orchestrate jobs or workflows. Sometimes, however, you may need to temporarily disable a trigger without deleting it—for example, to pause scheduled ingestions during maintenance or testing. This article explains how to disable a trigger using the AWS CLI. Understanding AWS Glue Triggers AWS Glue

Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

When working with PySpark, one of the first commands developers use to quickly inspect data is: However, in certain environments (especially when running inside PyCharm or VSCode with a debugger), you may encounter a warning like the following: At first glance, this message looks like an error in Spark itself, but in reality it is

Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that are new or modified since the last load. This approach reduces latency, improves efficiency, and lowers infrastructure costs. When designing incremental loads, a common dilemma arises: should the

Running Apache Airflow Across Environments

Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when moving between managed cloud services, local setups, and containerized deployments. This post provides a clear comparison of how Airflow operates in different contexts: 1. Airflow on