How Metadata Works in HDFS and What It Stores
HDFS stores metadata separately from the actual file content to optimize performance and scalability. This metadata is managed entirely by the NameNode, which allows clients to quickly locate and access data blocks across the cluster.
What is Metadata in HDFS?
Metadata is data about data. In the context of HDFS, it tells the system what files exist, where their blocks are stored, and how they should be accessed.
What Metadata Includes
| Metadata Type | Details |
|---|---|
| File and Directory Names | Full paths and hierarchy of stored files and directories |
| Block Mapping | Block IDs for each file and their corresponding DataNodes |
| File Size | Total size of the file |
| Block Size | The size of each block (e.g., 128MB or 256MB) |
| Replication Factor | Number of block replicas (typically 3) |
| Ownership and Permissions | User/group ownership and UNIX-like permission bits |
| Modification Timestamps | Tracks changes for consistency and auditing |
| Quotas | Space and file limits per directory (optional feature) |
How Metadata is Stored
| Storage Component | Purpose |
|---|---|
fsimage | A snapshot of the entire file system metadata, stored on disk |
edits | A transaction log of all changes made since the last fsimage checkpoint |
| In-Memory Storage | The NameNode loads the combined fsimage + edits into memory at startup |
During normal operation, the metadata is kept in RAM for fast access. Periodically, the system merges edits into fsimage to form a new consistent snapshot.
Metadata Operations (Examples)
| Operation | Metadata Accessed or Updated |
|---|---|
| File creation | Namespace updated, block mapping initialized |
| File read | File-to-block mapping retrieved to direct the client |
| File delete | File and block records removed from metadata |
| Permission change | Ownership and access bits updated |
What Metadata Does Not Include
HDFS metadata does not store:
- File content or block data
- File format details (e.g., CSV, JSON structure)
- Semantic meaning of data (e.g., column names)
That kind of information is handled at the application or processing layer, such as Hive or Spark.
Summary
HDFS metadata is critical for efficient file management. By separating metadata from file content and managing it centrally through the NameNode, HDFS achieves both speed and scalability, enabling massive data workloads across distributed clusters.