The Architecture of HDFS: NameNode, DataNodes, and Metadata
HDFS (Hadoop Distributed File System) was built to support the reliable storage and access of large datasets distributed across commodity hardware. To make this possible, HDFS relies on a master/slave architecture composed of two main types of nodes: the NameNode and the DataNodes.
1. The NameNode (Master)
The NameNode is the brain of HDFS. It manages:
- Metadata: Keeps track of the filesystem namespace (file names, directories, permissions).
- Block mapping: Records which DataNode holds which block of a file.
- Cluster status: Monitors the health of DataNodes through regular heartbeats.
However, the NameNode does not store the file data itself—only metadata about the data.
Example:
If you store a 300MB file (with 128MB block size), the NameNode will know:
- The file is divided into 3 blocks.
- Block 1 is on DataNode A, Block 2 on B, and Block 3 on C.
- Each block has 3 replicas (e.g., A, D, and F).
2. The DataNodes (Workers)
DataNodes are responsible for:
- Storing the actual data blocks.
- Serving read/write requests from HDFS clients.
- Sending heartbeat and block reports to the NameNode.
They don’t know what they’re storing—just that they hold a block identified by a block ID.
3. The Client
The client interacts with both:
- The NameNode, to get metadata (e.g., block locations).
- The DataNodes, to read/write file blocks directly.
This design reduces the load on the NameNode and allows for high-throughput data transfer.
4. Block-Based Storage
Files in HDFS are split into large blocks (usually 128MB or 256MB). These blocks are:
- Stored independently.
- Distributed across multiple DataNodes.
- Replicated (default replication factor is 3) for fault tolerance.
5. How Metadata Is Stored
The NameNode stores metadata in memory for fast access, and persists it to disk in:
- A namespace image (
fsimage): a snapshot of the filesystem. - An edit log (
edits): a transaction log of recent changes.
On restart, the NameNode combines these to restore its state.
Summary of HDFS Architecture
| Component | Role |
|---|---|
| NameNode | Stores metadata and controls the system |
| DataNodes | Store actual file data (blocks) |
| Client | Reads/writes data by talking to both |
This architecture allows HDFS to scale horizontally and handle very large volumes of data reliably, even in the face of hardware failures.