Geek Logbook

Tech sea log book

Understanding the Strategy Design Pattern

In the landscape of software design, maintaining flexibility and scalability is crucial. One of the most effective ways to achieve these qualities is by leveraging design patterns. Among the behavioral design patterns, the Strategy Pattern stands out as a powerful tool to manage algorithms dynamically. What is the Strategy Pattern? The Strategy Pattern allows you

How to Disable an AWS Glue Trigger from the CLI

When working with AWS Glue, triggers are an important mechanism to orchestrate jobs or workflows. Sometimes, however, you may need to temporarily disable a trigger without deleting it—for example, to pause scheduled ingestions during maintenance or testing. This article explains how to disable a trigger using the AWS CLI. Understanding AWS Glue Triggers AWS Glue

Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

When working with PySpark, one of the first commands developers use to quickly inspect data is: However, in certain environments (especially when running inside PyCharm or VSCode with a debugger), you may encounter a warning like the following: At first glance, this message looks like an error in Spark itself, but in reality it is

Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that are new or modified since the last load. This approach reduces latency, improves efficiency, and lowers infrastructure costs. When designing incremental loads, a common dilemma arises: should the

Running Apache Airflow Across Environments

Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when moving between managed cloud services, local setups, and containerized deployments. This post provides a clear comparison of how Airflow operates in different contexts: 1. Airflow on

Can You Perform Data Grouping Directly with the yFinance API?

When working with financial data, efficient aggregation and analysis are essential for generating meaningful insights. A common question among developers and data analysts is whether the yFinance Python library, a popular tool for retrieving historical stock market data, allows grouping or aggregation of data directly via its API. The short answer is: no, yFinance does

Optimizing Partition Strategies in Apache Iceberg on AWS

When working with large-scale analytical datasets, efficient partitioning is critical for achieving optimal query performance and cost savings. Apache Iceberg, a modern table format designed for big data, offers powerful partitioning capabilities. One common design decision is whether to use a single date column (e.g., yyyymmdd) or separate columns for year, month, and day (year,

How Transactions Work in Databricks Using Delta Lake

Databricks is a powerful platform for big data analytics and machine learning. One of its key features is the ability to run transactional workloads over large-scale data lakes using Delta Lake. This post explores how transactions are supported in Databricks and how you can use them to ensure data consistency and integrity. What Are Transactions