Geek Logbook

Tech sea log book

Managing Evolving Schemas in Apache Spark: A Strategic Approach

Schema management is one of the most overlooked yet critical aspects of building reliable data pipelines. In a fast-moving environment, schemas rarely remain static: new fields are added, data types evolve, and nested structures become more complex. Relying on hard-coded schemas within Spark jobs may seem convenient at first, but it quickly turns into a

Secure Ways to Share Private Data on AWS: Beyond Public Buckets

When building data platforms in the cloud, it is common to share data with partners, clients, or internal teams outside your own. AWS provides several mechanisms to grant secure, granular access — far beyond the simple (and risky) “make the bucket public” approach. In this post, we will explore the main strategies for sharing data

Fixing Cursor Login Issues on Linux (AppImage)

When running Cursor on Linux, especially with the AppImage version, you might encounter a situation where you can’t log in. This usually happens because Cursor stores its session state locally, and sometimes that state gets corrupted. In this article, we’ll walk through how to diagnose the issue and reset your session state without losing your

Querying JSONB in PostgreSQL Efficiently

In modern applications, it is common to store semi-structured data in JSON format inside a relational database like PostgreSQL. However, to analyze this data properly, you need a way to transform it into a tabular structure that can be queried with standard SQL. In this article, we will demonstrate a real-world example of reading a

Designing a Semantic Layer for Athena + Power BI

Modern data architectures benefit from a clear separation of layers: Ingesta, Staging, and Semantic (Presentation). When using Amazon Athena as the query engine and Power BI as the visualization tool, this layered approach enables scalability, governance, and cost control. 1. Ingesta (Raw Layer) Purpose: Store data exactly as it arrives from source systems, preserving fidelity.

Understanding Window Functions in SQL: Beyond Simple Aggregations

When we think about SQL functions, we often start with scalar functions (UPPER(), ROUND(), NOW()) or aggregate functions (SUM(), AVG(), COUNT()). But there is a third type that is essential for advanced analytics: window functions. The “Window”: The Metaphor Behind the Concept A window function is evaluated for every row, but not in isolation —

How to Set CloudWatch Log Retention Policies with Terraform

AWS CloudWatch is a powerful service for monitoring applications and infrastructure. However, by default, CloudWatch Logs are configured to never expire. This can lead to excessive storage costs and retention of data that you may not need. A better approach is to define a retention policy that aligns with your operational and compliance requirements. In

Orchestrating Multiple AWS Glue Workflows with Step Functions

In modern data architectures, it is common to manage multiple ETL pipelines that must run in sequence or in parallel. AWS Glue provides a robust framework for building workflows, but when we need to orchestrate two or more Glue Workflows together, AWS Step Functions becomes the natural choice. In this post, we will explain how

Understanding the Strategy Design Pattern

In the landscape of software design, maintaining flexibility and scalability is crucial. One of the most effective ways to achieve these qualities is by leveraging design patterns. Among the behavioral design patterns, the Strategy Pattern stands out as a powerful tool to manage algorithms dynamically. What is the Strategy Pattern? The Strategy Pattern allows you