Geek Logbook

Tech sea log book

What is HDFS and Why Was It Revolutionary for Big Data?

In the early 2000s, the world was generating data at a scale never seen before—web logs, social media, sensors, and more. Traditional storage systems simply couldn’t keep up with the volume, velocity, and variety of this data. Enter HDFS: the Hadoop Distributed File System, a cornerstone of the Apache Hadoop ecosystem. This blog post explains

What Is Serialization?

In the world of data engineering and software systems, serialization is a fundamental concept that allows you to efficiently store, transmit, and reconstruct data structures. If you’ve worked with formats like Parquet, Avro, JSON, or CSV, you’ve already interacted with serialization—whether you knew it or not. In this post, we’ll explore: What Is Serialization? Serialization

From HDFS to S3: The Evolution of Data Lakes in the Cloud

For years, HDFS (Hadoop Distributed File System) was the default choice for building data lakes in on-premises and Hadoop-based environments. But as cloud computing gained momentum, a new player took the lead: Amazon S3. Today, S3 is widely recognized as the de facto data lake storage layer in the AWS ecosystem. How did this shift

The History and Evolution of Amazon S3: Was It Ever Based on HDFS?

When discussing cloud storage today, Amazon S3 is almost synonymous with scalable, reliable object storage. However, a common question among those familiar with big data technologies like Hadoop is:Was Amazon S3 ever based on HDFS (Hadoop Distributed File System)? The short answer is: No. Amazon S3: Launched Before HDFS Amazon S3 was officially launched on

OLTP vs. OLAP: How JOINs and Efficiency Shape Their Differences

Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are two distinct database architectures, each designed for different purposes. One key factor that differentiates them is how they handle JOIN operations and the impact these have on query performance. In this post, we’ll explore these differences and why OLAP tends to be more efficient for

The Origins of OLTP and OLAP: A Brief History

Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are fundamental concepts in database management, each serving distinct purposes. But when did these terms first appear, and how did they evolve? Let’s explore their origins and how they became the cornerstone of modern data systems. The Emergence of OLTP The concept of Online Transaction Processing

Enabling Internet Access for Resources in a Public Subnet

When deploying resources in a public subnet within an AWS Virtual Private Cloud (VPC), you need to configure several components to allow them to communicate with the internet. Below are the essential steps: 1. Attach an Internet Gateway (IGW) An Internet Gateway (IGW) enables communication between instances in your VPC and the internet. To set

Network Address Translation (NAT): Overcoming IPv4 Shortages

Introduction Network Address Translation (NAT) is a technology designed to mitigate the shortage of IPv4 addresses by allowing multiple devices on a private network to share a limited number of public IP addresses. This process involves translating private IPv4 addresses to public addresses, enabling seamless communication with external networks. Types of NAT There are three

Why OLTP Systems Don’t Retain Historical Changes

Online Transaction Processing (OLTP) systems are designed for high-speed transactions and efficient data management. However, one of their characteristics is that they do not retain historical changes by default. In this post, we will explore why this happens and provide an example to illustrate the concept. OLTP Systems: Focused on Current Data OLTP databases are

Understanding the Relationship Between Database Replication and the CAP Theorem

Introduction Database replication is a fundamental strategy in distributed systems that ensures data is duplicated across multiple nodes. However, when designing a replicated database, one must consider the CAP theorem, which defines the fundamental trade-offs in distributed computing. In this post, we will explore how the CAP theorem applies to database replication and what trade-offs