Geek Logbook

Tech sea log book

How Spark and MapReduce Handle Partial Records in HDFS

When working with large-scale data processing frameworks like Apache Spark or Hadoop MapReduce, one common question arises:

What happens when a record (e.g., a line of text or a JSON object) is split across two HDFS blocks?

Imagine a simple scenario where the word "father" is split across two blocks like this:

  • Block 1: fa
  • Block 2: ther

How do distributed processing systems handle this without corrupting or losing data?

Let’s break it down.

HDFS: Physical Blocks vs Logical Records

HDFS splits files into blocks (typically 128MB or 256MB), and these blocks are distributed across multiple DataNodes. However:

  • HDFS blocks are physical divisions of data.
  • Records, like lines in a text file or rows in a CSV, are logical units of data.
  • A single record can span two blocks.

This distinction is essential: while HDFS stores data in blocks, data processing frameworks like Spark and MapReduce operate on logical records.

How MapReduce Handles It

  • Hadoop uses an InputFormat (such as TextInputFormat) to logically split files into InputSplits.
  • Each InputSplit is assigned to a mapper.
  • A RecordReader:
    • Seeks the beginning of the split.
    • Skips partial records if it starts in the middle of one.
    • Reads full records, even if part of a record resides in the next block on a different server.

If a record begins at the end of one block and continues into the next, the RecordReader will fetch the remaining data from the next block—this includes making a remote read if that block is on another DataNode.

How Spark Handles It

  • Spark uses Hadoop’s InputFormat system under the hood. For example, sc.textFile("path") uses TextInputFormat.
  • Spark’s scheduler assigns tasks based on InputSplits.
  • Each Spark task reads full records from its assigned split.
  • If a record crosses a block boundary, Spark’s internal reader transparently retrieves the necessary data from the next block.

This behavior ensures that users always work with complete records, without needing to handle block boundaries manually.

What Happens with Remote Blocks

If a portion of a record lies in a block stored on another DataNode:

  • The HDFS client queries the NameNode for block metadata.
  • It fetches the necessary data from the correct DataNode over the network.
  • The process is transparent to both the developer and the application.

Although this introduces a small I/O overhead, it’s negligible compared to the cost of handling incorrect or partial data.

Summary

AspectBehavior
Record spanning blocksFully supported
Partial records processedNo — only complete records are read
Remote block readsHandled via HDFS and NameNode metadata
Developer effort requiredNone — handled internally

Real-World Implication

You don’t need to worry about words or lines being split across blocks like fa/ther. Spark and MapReduce guarantee record-level integrity, even in a distributed environment.

Conclusion

Handling partial records in HDFS is a critical feature for ensuring correctness in distributed data processing. Thanks to the design of Hadoop’s InputFormat and HDFS’s ability to support remote reads, frameworks like Spark and MapReduce can safely and transparently read complete records—even if those records span multiple blocks or nodes.

No manual intervention is required, and developers can focus on writing business logic, confident that their data pipelines are processing whole records correctly.

Tags: