Geek Logbook

Tech sea log book

The Architecture of HDFS: NameNode, DataNodes, and Metadata

HDFS (Hadoop Distributed File System) was built to support the reliable storage and access of large datasets distributed across commodity hardware. To make this possible, HDFS relies on a master/slave architecture composed of two main types of nodes: the NameNode and the DataNodes. 1. The NameNode (Master) The NameNode is the brain of HDFS. It

What Happens When HDFS Splits Files Mid-Word or Mid-Row?

HDFS is designed to store and process massive amounts of data efficiently. One of its key design decisions is to split files into large, fixed-size blocks, typically 128MB or 256MB. But what happens when a file is split right in the middle of a sentence, word, or row? This post will help you understand how

How HDFS Handles File Partitioning and Block Distribution

One of the key innovations behind the Hadoop Distributed File System (HDFS) is how it breaks down large files and distributes them across multiple machines. This mechanism, called partitioning and block distribution, enables massive scalability and fault tolerance. But how exactly does it work? This post breaks it down clearly so you can understand how

What is HDFS and Why Was It Revolutionary for Big Data?

In the early 2000s, the world was generating data at a scale never seen before—web logs, social media, sensors, and more. Traditional storage systems simply couldn’t keep up with the volume, velocity, and variety of this data. Enter HDFS: the Hadoop Distributed File System, a cornerstone of the Apache Hadoop ecosystem. This blog post explains

What Is Serialization?

In the world of data engineering and software systems, serialization is a fundamental concept that allows you to efficiently store, transmit, and reconstruct data structures. If you’ve worked with formats like Parquet, Avro, JSON, or CSV, you’ve already interacted with serialization—whether you knew it or not. In this post, we’ll explore: What Is Serialization? Serialization

Is S3 the New HDFS? Comparisons and Use Cases in Big Data

Over the past decade, the way organizations store and manage big data has shifted dramatically. Once dominated by the Hadoop Distributed File System (HDFS), the field is now led by Amazon S3 and similar cloud object storage systems. This raises a compelling question in today’s data engineering world: Is Amazon S3 the new HDFS? Let’s

From HDFS to S3: The Evolution of Data Lakes in the Cloud

For years, HDFS (Hadoop Distributed File System) was the default choice for building data lakes in on-premises and Hadoop-based environments. But as cloud computing gained momentum, a new player took the lead: Amazon S3. Today, S3 is widely recognized as the de facto data lake storage layer in the AWS ecosystem. How did this shift

The History and Evolution of Amazon S3: Was It Ever Based on HDFS?

When discussing cloud storage today, Amazon S3 is almost synonymous with scalable, reliable object storage. However, a common question among those familiar with big data technologies like Hadoop is:Was Amazon S3 ever based on HDFS (Hadoop Distributed File System)? The short answer is: No. Amazon S3: Launched Before HDFS Amazon S3 was officially launched on

MapReduce: A Framework for Processing Unstructured Data

MapReduce is both a programming model and a framework designed to process massive volumes of data across distributed systems. It gained popularity primarily due to its efficiency in handling unstructured or semi-structured data, especially text. Key Concepts of MapReduce Strength in Text Processing MapReduce excels with text data for several reasons: Beyond Text: Processing Other

Understanding .master() in Apache Spark

In Apache Spark, the .master() method is used to specify how your application will run, either on your local machine or on a cluster. Choosing the correct option is essential depending on your environment. This post will explain the different .master() options in Spark and when to use them. Local Mode The local mode runs