Geek Logbook

Tech sea log book

From HDFS to S3: The Evolution of Data Lakes in the Cloud

For years, HDFS (Hadoop Distributed File System) was the default choice for building data lakes in on-premises and Hadoop-based environments. But as cloud computing gained momentum, a new player took the lead: Amazon S3. Today, S3 is widely recognized as the de facto data lake storage layer in the AWS ecosystem.

How did this shift happen? Let’s explore the evolution from HDFS to S3 in the context of data lakes.

HDFS: The Original Data Lake Backbone

In the early 2010s, the concept of a data lake—a central repository for storing all types of raw data—became popular. HDFS was a natural fit:

  • Designed for distributed storage across many nodes.
  • Optimized for large, append-only files.
  • A core component of the Hadoop ecosystem used with MapReduce, Hive, and Pig.

As Tom White describes in Hadoop: The Definitive Guide, HDFS enabled organizations to build scalable storage platforms for batch-oriented big data processing.

However, HDFS came with trade-offs:

  • Tight coupling between storage and compute (you needed to maintain the cluster).
  • Scaling required provisioning and managing more hardware.
  • Data access was often limited to Hadoop-compatible tools.

The Rise of Amazon S3

As cloud adoption grew, AWS offered a new model with Amazon S3:

  • Object storage designed for infinite scalability.
  • Durability across multiple Availability Zones.
  • Decoupled from compute—pay only for storage, no need to manage servers.
  • Seamless integration with AWS services: Athena, Redshift Spectrum, Glue, EMR, and more.

S3 allowed companies to shift away from Hadoop clusters while still storing massive datasets in open formats like Parquet, ORC, and CSV.

Why S3 Became the New Data Lake

S3 won the data lake battle in the cloud for several key reasons:

  • Serverless analytics: Query S3 data directly with tools like Athena.
  • Storage-class options: Lifecycle policies, infrequent access tiers, and archival with Glacier.
  • Ecosystem support: Data can be consumed by AWS, third-party tools, and even on-prem systems.
  • Operational simplicity: No cluster maintenance, no replication config, automatic scaling.

Today, S3 is often referred to as “the data lake of AWS”—a role that HDFS previously held in the Hadoop world.

Key Takeaways

FeatureHDFSAmazon S3
TypeDistributed file systemObject storage
DeploymentOn-prem or Hadoop clusterCloud-native
ScalingManual (add nodes)Automatic
DurabilitySoftware-level replication99.999999999% across AZs
Data AccessHadoop toolsREST API, SQL engines, Spark
Cost ModelFixed compute + storagePay-as-you-go, tiered storage

Final Thoughts

While HDFS laid the foundation for modern data lakes, S3 has redefined the model in the cloud era. Its flexibility, scalability, and native integration with cloud services have made it the go-to choice for data lake architecture in AWS.

As organizations continue to move to the cloud, S3 will likely remain the central storage layer for modern, serverless, and AI-driven analytics.

Tags: