Geek Logbook

Tech sea log book

Can You Perform Data Grouping Directly with the yFinance API?

When working with financial data, efficient aggregation and analysis are essential for generating meaningful insights. A common question among developers and data analysts is whether the yFinance Python library, a popular tool for retrieving historical stock market data, allows grouping or aggregation of data directly via its API. The short answer is: no, yFinance does

Optimizing Partition Strategies in Apache Iceberg on AWS

When working with large-scale analytical datasets, efficient partitioning is critical for achieving optimal query performance and cost savings. Apache Iceberg, a modern table format designed for big data, offers powerful partitioning capabilities. One common design decision is whether to use a single date column (e.g., yyyymmdd) or separate columns for year, month, and day (year,

How Transactions Work in Databricks Using Delta Lake

Databricks is a powerful platform for big data analytics and machine learning. One of its key features is the ability to run transactional workloads over large-scale data lakes using Delta Lake. This post explores how transactions are supported in Databricks and how you can use them to ensure data consistency and integrity. What Are Transactions

Versioning Terraform Resources to Meet CIS Security Standards

Infrastructure as Code (IaC) has become a foundational practice for modern DevOps and cloud-native teams. Terraform, as one of the most widely adopted IaC tools, enables infrastructure automation, consistency, and repeatability. However, when working in regulated environments or organizations with strict compliance requirements, it’s not enough to just automate. You must also govern and secure

Handling Python datetime Objects in Amazon DynamoDB

When developing data pipelines or applications that store time-based records in Amazon DynamoDB, developers frequently encounter serialization errors when working with Python’s datetime objects. Understanding how to properly store temporal data in DynamoDB is essential to avoid runtime issues and to enable meaningful queries. The Problem DynamoDB, as a NoSQL database, supports a limited set

Choosing Between DynamoDB and Cassandra for a Crypto Exchange

When designing the backend of a crypto exchange, selecting the right database architecture is crucial. Two common NoSQL databases often considered for this type of application are Amazon DynamoDB and Apache Cassandra. Both offer horizontal scalability and high availability, but they shine in different use cases. This post explores their differences using concrete examples from

AWS Glue Workflow vs Apache Airflow: A Professional Comparison

While both serve the common purpose of managing and automating data workflows, they differ significantly in architecture, flexibility, integration capabilities, and operational control. This article offers a comprehensive and professional comparison of AWS Glue Workflow and Apache Airflow to help data engineers, architects, and decision-makers choose the most suitable tool for their use case. 1.

Reducing AWS Costs: How to Temporarily Stop an Aurora Serverless v2 Cluster

When managing cloud infrastructure, minimizing costs without compromising data integrity is a continuous priority. Amazon Aurora Serverless v2 offers scalability and high availability, but unlike traditional RDS instances, it introduces nuances in how compute resources are billed. One common question arises: Can an Aurora Serverless v2 database be stopped to save costs? Understanding Aurora Serverless

he Enduring Relevance of Peter Chen’s Entity-Relationship Model

In the landscape of data modeling, few contributions have had the long-lasting impact of Peter Chen’s Entity-Relationship (E-R) Model, introduced in 1976. More than four decades later, it remains a foundational framework for conceptualizing and designing data systems—bridging the gap between abstract business understanding and concrete database implementation. A Unified View of Data Chen’s model

How Hadoop Made Specialized Storage Hardware Obsolete

In the early 2000s, enterprise data processing was dominated by high-end hardware. Organizations relied heavily on centralized storage systems such as SAN (Storage Area Networks) and NAS (Network Attached Storage), typically connected to symmetric multiprocessing (SMP) servers or high-performance computing (HPC) clusters. These environments were expensive to scale, difficult to manage, and designed to avoid