Geek Logbook

Tech sea log book

How Metadata Works in HDFS and What It Stores

HDFS stores metadata separately from the actual file content to optimize performance and scalability. This metadata is managed entirely by the NameNode, which allows clients to quickly locate and access data blocks across the cluster.

What is Metadata in HDFS?

Metadata is data about data. In the context of HDFS, it tells the system what files exist, where their blocks are stored, and how they should be accessed.

What Metadata Includes

Metadata TypeDetails
File and Directory NamesFull paths and hierarchy of stored files and directories
Block MappingBlock IDs for each file and their corresponding DataNodes
File SizeTotal size of the file
Block SizeThe size of each block (e.g., 128MB or 256MB)
Replication FactorNumber of block replicas (typically 3)
Ownership and PermissionsUser/group ownership and UNIX-like permission bits
Modification TimestampsTracks changes for consistency and auditing
QuotasSpace and file limits per directory (optional feature)

How Metadata is Stored

Storage ComponentPurpose
fsimageA snapshot of the entire file system metadata, stored on disk
editsA transaction log of all changes made since the last fsimage checkpoint
In-Memory StorageThe NameNode loads the combined fsimage + edits into memory at startup

During normal operation, the metadata is kept in RAM for fast access. Periodically, the system merges edits into fsimage to form a new consistent snapshot.

Metadata Operations (Examples)

OperationMetadata Accessed or Updated
File creationNamespace updated, block mapping initialized
File readFile-to-block mapping retrieved to direct the client
File deleteFile and block records removed from metadata
Permission changeOwnership and access bits updated

What Metadata Does Not Include

HDFS metadata does not store:

  • File content or block data
  • File format details (e.g., CSV, JSON structure)
  • Semantic meaning of data (e.g., column names)

That kind of information is handled at the application or processing layer, such as Hive or Spark.

Summary

HDFS metadata is critical for efficient file management. By separating metadata from file content and managing it centrally through the NameNode, HDFS achieves both speed and scalability, enabling massive data workloads across distributed clusters.

Tags: