Geek Logbook

Tech sea log book

How Clients Know Where to Read or Write in HDFS

Hadoop Distributed File System (HDFS) is designed to decouple metadata management from actual data storage. But how does a client—like a Spark job or command-line tool—know where to read or write the bytes of a file across a distributed system? Let’s break down what happens when a client interacts with HDFS. The Role of the

How HDFS Avoids Understanding File Content

One of the defining features of Hadoop Distributed File System (HDFS) is that it doesn’t understand the contents of the files it stores. This is not a limitation—it’s an intentional design choice that makes HDFS flexible, scalable, and efficient for big data workloads. HDFS Is Content-Agnostic HDFS handles files as byte streams. It doesn’t care

How HDFS Tracks Block Size and File Boundaries

When dealing with massive files, Hadoop Distributed File System (HDFS) doesn’t read or store them as a whole. Instead, it splits them into large, fixed-size blocks. But how does it know where each block starts and ends? Let’s dive into how HDFS tracks block size and file boundaries behind the scenes. Fixed Block Size Each

How Metadata Works in HDFS and What It Stores

HDFS stores metadata separately from the actual file content to optimize performance and scalability. This metadata is managed entirely by the NameNode, which allows clients to quickly locate and access data blocks across the cluster. What is Metadata in HDFS? Metadata is data about data. In the context of HDFS, it tells the system what

The Architecture of HDFS: NameNode, DataNodes, and Metadata

HDFS (Hadoop Distributed File System) was built to support the reliable storage and access of large datasets distributed across commodity hardware. To make this possible, HDFS relies on a master/slave architecture composed of two main types of nodes: the NameNode and the DataNodes. 1. The NameNode (Master) The NameNode is the brain of HDFS. It

What Happens When HDFS Splits Files Mid-Word or Mid-Row?

HDFS is designed to store and process massive amounts of data efficiently. One of its key design decisions is to split files into large, fixed-size blocks, typically 128MB or 256MB. But what happens when a file is split right in the middle of a sentence, word, or row? This post will help you understand how

How HDFS Handles File Partitioning and Block Distribution

One of the key innovations behind the Hadoop Distributed File System (HDFS) is how it breaks down large files and distributes them across multiple machines. This mechanism, called partitioning and block distribution, enables massive scalability and fault tolerance. But how exactly does it work? This post breaks it down clearly so you can understand how