How Metadata Works in HDFS and What It Stores

By - Geek Logbook
Posted on 2025-05-04
Posted in Notes

How Metadata Works in HDFS and What It Stores

HDFS stores metadata separately from the actual file content to optimize performance and scalability. This metadata is managed entirely by the NameNode, which allows clients to quickly locate and access data blocks across the cluster.

What is Metadata in HDFS?

Metadata is data about data. In the context of HDFS, it tells the system what files exist, where their blocks are stored, and how they should be accessed.

What Metadata Includes

Metadata Type	Details
File and Directory Names	Full paths and hierarchy of stored files and directories
Block Mapping	Block IDs for each file and their corresponding DataNodes
File Size	Total size of the file
Block Size	The size of each block (e.g., 128MB or 256MB)
Replication Factor	Number of block replicas (typically 3)
Ownership and Permissions	User/group ownership and UNIX-like permission bits
Modification Timestamps	Tracks changes for consistency and auditing
Quotas	Space and file limits per directory (optional feature)

How Metadata is Stored

Storage Component	Purpose
`fsimage`	A snapshot of the entire file system metadata, stored on disk
`edits`	A transaction log of all changes made since the last `fsimage` checkpoint
In-Memory Storage	The NameNode loads the combined `fsimage` + `edits` into memory at startup

During normal operation, the metadata is kept in RAM for fast access. Periodically, the system merges edits into fsimage to form a new consistent snapshot.

Metadata Operations (Examples)

Operation	Metadata Accessed or Updated
File creation	Namespace updated, block mapping initialized
File read	File-to-block mapping retrieved to direct the client
File delete	File and block records removed from metadata
Permission change	Ownership and access bits updated

What Metadata Does Not Include

HDFS metadata does not store:

File content or block data
File format details (e.g., CSV, JSON structure)
Semantic meaning of data (e.g., column names)

That kind of information is handled at the application or processing layer, such as Hive or Spark.

Summary

HDFS metadata is critical for efficient file management. By separating metadata from file content and managing it centrally through the NameNode, HDFS achieves both speed and scalability, enabling massive data workloads across distributed clusters.

Tags:HDFS

Geek Logbook

Recent Posts

Categories

Archives

How Metadata Works in HDFS and What It Stores

What is Metadata in HDFS?

What Metadata Includes

How Metadata is Stored

Metadata Operations (Examples)

What Metadata Does Not Include

Summary

Previous Article

Next Article