Geek Logbook

Tech sea log book

When Should You Use Parquet and When Should You Use Iceberg?

In modern data architectures, selecting the right storage and management solution is essential for building efficient, reliable, and scalable pipelines. Two popular choices that often come up are Parquet and Apache Iceberg. While they can work together, they serve different purposes and solve different problems. This article explains what each one is, when to use

How to Fix ‘DataFrame’ object has no attribute ‘writeTo’ When Working with Apache Iceberg in PySpark

If you’re working with Apache Iceberg in PySpark and encounter this error: You’re not alone. This is a common mistake when transitioning from the traditional DataFrame.write syntax to Iceberg’s DataFrameWriterV2 API. Let’s walk through why this happens, how to fix it quickly, and when to use each writing method. Why This Error Happens The method

What Is Sharding and Why It Matters

As our world becomes increasingly digital, the amount of data we create every day is staggering. Think about all the emails, messages, orders, and photos uploaded every second. How do big companies manage and store so much information efficiently? One of the key techniques they use is called sharding. What Is Sharding? Sharding is a

From Tables to Partitions: Designing NoSQL Databases with Cassandra

As data professionals transition from relational databases to NoSQL systems like Apache Cassandra, one of the most important mindset shifts is understanding that you don’t model data for storage, but for queries. This departure from the familiar world of third normal form (3NF) requires not only technical adjustments but also a new way of thinking

Apache Cassandra vs Apache Parquet: Understanding the Differences

In modern data architectures, it’s common to encounter both Apache Cassandra and Apache Parquet, particularly when dealing with large-scale, distributed systems. Both technologies are associated with columnar data models, which often leads to confusion. However, Cassandra and Parquet serve fundamentally different purposes and operate at different layers of the data stack. This article clarifies their

Import Live Crypto Prices into Google Sheets

Are you tired of checking crypto prices manually? Want to automate your portfolio tracking or build a custom crypto dashboard? Good news — with just a few steps, you can pull live cryptocurrency prices directly into Google Sheets. In this guide, we’ll show you three simple methods to get real-time crypto data, whether you’re a

How Dynamo Reshaped the Internal Architecture of Amazon S3

IntroductionAmazon S3 launched in 2006 as a scalable, durable object storage system. It avoided hierarchical file systems and used flat key-based addressing from day one. However, early versions of S3 ran into architectural challenges—especially in metadata consistency and fault tolerance. Meanwhile, another internal team at Amazon was building Dynamo, a distributed key-value store optimized for

What’s Behind Amazon S3?

When you upload a file to the cloud using an app or service, there’s a good chance it’s being stored on Amazon S3 (Simple Storage Service). But what powers it under the hood? What is Amazon S3? Amazon S3 is an object storage service that allows users to store and retrieve any amount of data,

Summary: Teaching HDFS Concepts to New Learners

Introducing Hadoop Distributed File System (HDFS) to newcomers can be both exciting and challenging. To make the learning experience structured and impactful, it’s helpful to break down the core topics into digestible parts. This blog post summarizes a beginner-friendly teaching sequence based on real questions and progressive discovery. Key Topics to Cover Teaching Tips Conclusion