Geek Logbook

Tech sea log book

Ranking Products Using Window Functions in PySpark

Introduction Window functions are powerful tools in SQL and PySpark that allow us to perform calculations across a subset of rows related to the current row. In this blog post, we’ll explore how to use window functions in PySpark to rank products based on their sales and filter those with sales above the category average.

Handling Null Values in Data: Algorithms and Strategies

Null values are a common challenge in data analysis and machine learning. Dealing with them effectively is essential to ensure the reliability of your insights and models. In this post, we’ll explore various strategies and algorithms to handle null values, ranging from simple techniques to advanced methods. 1. Removing Null Values This is the simplest

What Does an Exploratory Data Analysis (EDA) Evaluate?

An Exploratory Data Analysis (EDA) is a critical step in the data analysis process that focuses on evaluating and examining data to uncover its main characteristics. It is performed before delving deeper into analysis or building predictive models. The primary purpose of an EDA is to understand the dataset, identify issues, and gain insights that

Exploring Free Resources to Learn AWS and Azure Cloud Platforms

Cloud computing is an essential skill in today’s tech landscape. Among the major players, AWS and Azure stand out as leading cloud platforms, offering a wealth of free resources to help individuals learn and experiment. This blog post outlines some of the most valuable free tools, learning paths, and tips for getting started with AWS

Adding Custom Columns to Your Date Table in Power BI

Introduction A Date Table is an integral part of building robust and insightful Power BI reports. While a basic Date Table allows for time-based filtering and analysis, custom columns can add even more depth and flexibility. This blog post will guide you through adding custom columns to your Date Table using DAX. 1. Why Add

Grouping Data in PySpark with Aliases for Aggregated Columns

When working with large datasets in PySpark, grouping data and applying aggregations is a common task. In this post, we’ll explore how to group data by a specific column and use aliases for the resulting aggregated columns to improve readability and clarity. Problem Statement Consider the following sample dataset: IdCompra Fecha IdProducto Cantidad Precio IdProveedor

Handling Offset-Naive and Offset-Aware Datetimes in Python

When working with datetime objects in Python, you may encounter the error: This error occurs when comparing two datetime objects where one contains timezone information (offset-aware) and the other does not (offset-naive). To resolve this, you must ensure both datetime objects are either offset-aware or offset-naive before making the comparison. Making a Datetime Offset-Aware in

Automating SQL Script Execution with Cron

In this blog post, we’ll explore how to automate the execution of SQL scripts using cron, a powerful scheduling tool available on Unix-based systems. This approach is ideal for database administrators and developers who need to run SQL scripts at specific intervals without manual intervention. Overview Cron jobs allow you to schedule tasks to run

Counting Word Frequency in a SQL Column

Sometimes, you may need to analyze text data stored in a database, such as counting the frequency of words in a text column. This blog post demonstrates how to achieve this in SQL using a practical example. Problem Overview Let’s assume you have a table named feedback with a column comentarios that contains text data.