Geek Logbook

Tech sea log book

EMR vs AWS Glue: Choosing the Right Data Processing Tool on AWS

When working with big data on AWS, two commonly used services for data processing are Amazon EMR and AWS Glue. Although both support scalable data transformation and analytics, they differ significantly in architecture, control, use cases, and cost models. Choosing the right tool depends on your specific workload, performance needs, and operational preferences. In this

How Google Changed Big Data: The Story of GFS, MapReduce, and Bigtable

In the early 2000s, Google faced a unique challenge: how to store, process, and query massive amounts of data across thousands of unreliable machines. The traditional systems of the time—designed for a world of smaller datasets and centralized infrastructure—simply couldn’t keep up. Google responded by designing an entirely new architecture. It wasn’t just about solving

ecure Database Access in AWS Using SSH Tunneling

Accessing databases located in private subnets within AWS Virtual Private Clouds (VPCs) is a common requirement in enterprise architectures. To ensure secure connectivity without exposing the database to the public internet, developers and operations engineers often employ SSH tunneling via a bastion host. Background Databases in a private subnet cannot be accessed directly from external

How Network Topology Shapes Distributed Computing and Big Data Systems

When discussing distributed systems and Big Data, people often focus on storage, processing frameworks, and scalability—but one foundational concept underlies it all: network topology. It’s the invisible architecture that dictates how data flows, how quickly systems respond, and how resilient your applications can be. Let’s explore what network topology is, how it evolved, and why

What Is Sharding and Why It Matters

As our world becomes increasingly digital, the amount of data we create every day is staggering. Think about all the emails, messages, orders, and photos uploaded every second. How do big companies manage and store so much information efficiently? One of the key techniques they use is called sharding. What Is Sharding? Sharding is a

From Tables to Partitions: Designing NoSQL Databases with Cassandra

As data professionals transition from relational databases to NoSQL systems like Apache Cassandra, one of the most important mindset shifts is understanding that you don’t model data for storage, but for queries. This departure from the familiar world of third normal form (3NF) requires not only technical adjustments but also a new way of thinking

Apache Cassandra vs Apache Parquet: Understanding the Differences

In modern data architectures, it’s common to encounter both Apache Cassandra and Apache Parquet, particularly when dealing with large-scale, distributed systems. Both technologies are associated with columnar data models, which often leads to confusion. However, Cassandra and Parquet serve fundamentally different purposes and operate at different layers of the data stack. This article clarifies their

How Dynamo Reshaped the Internal Architecture of Amazon S3

IntroductionAmazon S3 launched in 2006 as a scalable, durable object storage system. It avoided hierarchical file systems and used flat key-based addressing from day one. However, early versions of S3 ran into architectural challenges—especially in metadata consistency and fault tolerance. Meanwhile, another internal team at Amazon was building Dynamo, a distributed key-value store optimized for