Geek Logbook

Tech sea log book

When Should You Use Parquet and When Should You Use Iceberg?

In modern data architectures, selecting the right storage and management solution is essential for building efficient, reliable, and scalable pipelines. Two popular choices that often come up are Parquet and Apache Iceberg. While they can work together, they serve different purposes and solve different problems. This article explains what each one is, when to use

Summary: Teaching HDFS Concepts to New Learners

Introducing Hadoop Distributed File System (HDFS) to newcomers can be both exciting and challenging. To make the learning experience structured and impactful, it’s helpful to break down the core topics into digestible parts. This blog post summarizes a beginner-friendly teaching sequence based on real questions and progressive discovery. Key Topics to Cover Teaching Tips Conclusion

Is S3 the New HDFS? Comparisons and Use Cases in Big Data

Over the past decade, the way organizations store and manage big data has shifted dramatically. Once dominated by the Hadoop Distributed File System (HDFS), the field is now led by Amazon S3 and similar cloud object storage systems. This raises a compelling question in today’s data engineering world: Is Amazon S3 the new HDFS? Let’s

Optimizing Joins in PostgreSQL: Practical Cases

Joins are essential for querying relational databases, but they can significantly impact performance if not optimized correctly. PostgreSQL provides several ways to improve join efficiency, from indexing strategies to query restructuring. In this post, we’ll explore different types of joins, performance considerations, and practical ways to optimize them. Types of Joins in PostgreSQL PostgreSQL supports

Benchmarking OLTP vs. OLAP: Measuring Performance Effectively

Understanding the performance differences between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) is crucial for designing efficient database systems. This post outlines a structured approach to benchmarking these two architectures and measuring their efficiency based on real-world scenarios. Key Metrics for Benchmarking To compare OLTP and OLAP performance, we focus on the following

Comparison Between Star Schema and Snowflake Schema in PostgreSQL

Comparison Between Star Schema and Snowflake Schema in PostgreSQL When designing a database for analytical workloads, choosing the right schema can significantly impact performance and query efficiency. The two most common data warehouse schema models are Star Schema and Snowflake Schema. In this post, we’ll explore the differences between these schemas, their advantages and disadvantages,

Running PySpark on Google Colab: Do You Still Need findspark?

Introduction For a long time, using Apache Spark in Google Colab required manual setup, including installing Spark and configuring Python to recognize it. This was often done using the findspark library. However, recent changes in Colab have made this process much simpler. In this post, we will explore whether findspark is still necessary and the

Generating a Calendar Table in Power Query (M Language)

When working with Power BI or other Power Query-supported tools, having a well-structured calendar table is essential for time-based analysis. In this blog post, we will walk through an M Language function that generates a comprehensive calendar table. Why Use a Calendar Table? A calendar table provides essential time-based fields such as year, quarter, month,

Handling Schema Changes in a Data Warehouse

When building and maintaining a Data Warehouse (DWH), handling schema changes without breaking existing processes is a crucial challenge for data engineers. As new requirements emerge, we often need to add new fields, modify existing structures, or adjust data models while ensuring smooth operation for reporting and analytics. This blog post explores best practices and

Delta Lake vs. Traditional Data Lakes: Key Differences and Vendor Options

Introduction As data-driven organizations scale their analytics and machine learning workloads, the limitations of traditional data lakes become more apparent. Delta Lake is an open-source storage layer that enhances data lakes with ACID transactions, schema enforcement, and time travel, making them more reliable for big data workloads. In this post, we will explore how Delta