Geek Logbook

By - Geek Logbook
Posted on 2025-05-09
Posted in Notes

How Dynamo Reshaped the Internal Architecture of Amazon S3

IntroductionAmazon S3 launched in 2006 as a scalable, durable object storage system. It avoided hierarchical file systems and used flat key-based addressing from day one. However, early versions of S3 ran into architectural challenges—especially in metadata consistency and fault tolerance. Meanwhile, another internal team at Amazon was building Dynamo, a distributed key-value store optimized for

By - Geek Logbook
Posted on 2025-05-09
Posted in Notes

What’s Behind Amazon S3?

When you upload a file to the cloud using an app or service, there’s a good chance it’s being stored on Amazon S3 (Simple Storage Service). But what powers it under the hood? What is Amazon S3? Amazon S3 is an object storage service that allows users to store and retrieve any amount of data,

By - Geek Logbook
Posted on 2025-05-092025-05-09
Posted in Programming

Fixing Spark Ivy Error in Docker: “basedir must be absolute”

If you’re running Apache Spark inside Docker using Bitnami’s images and suddenly encounter an Ivy error that says: You’re not alone. This blog post walks you through the root cause and offers two solid solutions, including how to pin your Docker image to a known working version from mid-2024. Understanding the Error: The error is

By - Geek Logbook
Posted on 2025-05-06
Posted in Data

Summary: Teaching HDFS Concepts to New Learners

Introducing Hadoop Distributed File System (HDFS) to newcomers can be both exciting and challenging. To make the learning experience structured and impactful, it’s helpful to break down the core topics into digestible parts. This blog post summarizes a beginner-friendly teaching sequence based on real questions and progressive discovery. Key Topics to Cover Teaching Tips Conclusion

By - Geek Logbook
Posted on 2025-05-06
Posted in Notes

How HDFS Achieves Fault Tolerance Through Replication

One of the core strengths of the Hadoop Distributed File System (HDFS) is its fault tolerance. In a world of distributed computing, failures are not rare—they’re expected. HDFS tackles this by using block-level replication to ensure that data is never lost, even when individual nodes fail. What Is Replication in HDFS? When a file is

By - Geek Logbook
Posted on 2025-05-05
Posted in Notes

How Spark and MapReduce Handle Partial Records in HDFS

When working with large-scale data processing frameworks like Apache Spark or Hadoop MapReduce, one common question arises: What happens when a record (e.g., a line of text or a JSON object) is split across two HDFS blocks? Imagine a simple scenario where the word "father" is split across two blocks like this: How do distributed

By - Geek Logbook
Posted on 2025-05-05
Posted in Notes

How Clients Know Where to Read or Write in HDFS

Hadoop Distributed File System (HDFS) is designed to decouple metadata management from actual data storage. But how does a client—like a Spark job or command-line tool—know where to read or write the bytes of a file across a distributed system? Let’s break down what happens when a client interacts with HDFS. The Role of the

By - Geek Logbook
Posted on 2025-05-05
Posted in Notes

How HDFS Avoids Understanding File Content

One of the defining features of Hadoop Distributed File System (HDFS) is that it doesn’t understand the contents of the files it stores. This is not a limitation—it’s an intentional design choice that makes HDFS flexible, scalable, and efficient for big data workloads. HDFS Is Content-Agnostic HDFS handles files as byte streams. It doesn’t care

By - Geek Logbook
Posted on 2025-05-04
Posted in Notes

How HDFS Tracks Block Size and File Boundaries

When dealing with massive files, Hadoop Distributed File System (HDFS) doesn’t read or store them as a whole. Instead, it splits them into large, fixed-size blocks. But how does it know where each block starts and ends? Let’s dive into how HDFS tracks block size and file boundaries behind the scenes. Fixed Block Size Each

By - Geek Logbook
Posted on 2025-05-04
Posted in Notes

How Metadata Works in HDFS and What It Stores

HDFS stores metadata separately from the actual file content to optimize performance and scalability. This metadata is managed entirely by the NameNode, which allows clients to quickly locate and access data blocks across the cluster. What is Metadata in HDFS? Metadata is data about data. In the context of HDFS, it tells the system what

Recent Posts

Categories

Archives

How Dynamo Reshaped the Internal Architecture of Amazon S3

What’s Behind Amazon S3?

Fixing Spark Ivy Error in Docker: “basedir must be absolute”

Summary: Teaching HDFS Concepts to New Learners

How HDFS Achieves Fault Tolerance Through Replication

How Spark and MapReduce Handle Partial Records in HDFS

How Clients Know Where to Read or Write in HDFS

How HDFS Avoids Understanding File Content

How HDFS Tracks Block Size and File Boundaries

How Metadata Works in HDFS and What It Stores