Geek Logbook

Tech sea log book

Facebook and Big Data: The Open Source Projects That Changed the Industry

When people talk about the history of Big Data, a few companies come to mind: Google, Yahoo, and Facebook. Each of them faced unique challenges that forced them to build large-scale distributed systems. While Google introduced foundational concepts like MapReduce and the Google File System (later inspiring Hadoop), Facebook had to deal with billions of users generating massive volumes of social interactions, logs, and media content.

From these challenges, Facebook created several open source projects that went on to shape the entire data ecosystem. Some are still at the core of modern Data Lakehouse architectures.


Hive (2008): SQL on Hadoop

At Facebook, engineers were struggling with Hadoop and MapReduce jobs. Analysts wanted SQL, but Hadoop only offered low-level APIs. In 2008, the company introduced Hive, a system that provided a SQL-like query language (HiveQL) which translated into MapReduce jobs.

Hive made Hadoop accessible to a much larger group of people inside Facebook, since data analysts could finally query petabytes of data with familiar SQL syntax.

Even though Hive’s execution engine (MapReduce) has been largely replaced, the Hive Metastore—the metadata catalog—remains critical. Modern engines like Spark, Trino, Iceberg, and Delta Lake still use it as a source of truth for schemas and table definitions.


Cassandra (2008): Distributed NoSQL Database

Facebook’s inbox search feature demanded a database capable of handling massive amounts of writes and queries across distributed infrastructure. Relational databases weren’t enough. The result was Apache Cassandra, a NoSQL, column-oriented, distributed database designed for high availability and scalability.

Although Facebook eventually stopped using Cassandra internally, the project grew into a widely adopted database across the industry, powering large-scale applications at companies like Netflix, Uber, and Instagram.


Presto (2012): Interactive SQL at Scale

By 2012, Hive had become too slow for Facebook’s analysts. Queries could take minutes or even hours because they were batch-oriented. To solve this, Facebook developed Presto, a distributed SQL query engine optimized for interactive analytics.

Presto allowed users to query data stored not only in Hadoop, but also in MySQL, Cassandra, and other sources—all with a single SQL interface. It was fast, in-memory, and federated.

Later, the original creators of Presto left Facebook and launched Trino (initially called PrestoSQL), which has become the most widely used open source query engine for modern Data Lakehouse architectures.


RocksDB (2013): High-Performance Embedded Storage

As Facebook pushed the limits of online services and mobile apps, it needed a high-performance storage engine that could serve as the foundation for applications like Messenger and News Feed ranking. This led to RocksDB, an embedded, persistent key-value store built on top of LevelDB.

RocksDB became an essential component not only for Facebook but also for distributed systems like Kafka Streams, Flink, and MySQL derivatives that required fast storage with low latency.


Legacy and Impact

Facebook’s open source contributions were not just about solving internal problems. They reshaped the entire data landscape:

  • Hive made SQL the universal language of Big Data.
  • Cassandra pioneered globally distributed NoSQL databases.
  • Presto/Trino set the standard for federated, interactive SQL engines.
  • RocksDB enabled a new generation of low-latency applications.

Today, when we talk about Data Lakehouses, we are still relying on concepts and tools born out of Facebook’s scale. Even if the infrastructure has shifted from HDFS to object storage like S3 and MinIO, the DNA of modern analytics systems can be traced back to these innovations.


Conclusion

Facebook’s data challenges created solutions that became industry standards. The company’s decision to open source Hive, Cassandra, Presto, and RocksDB ensured that the wider ecosystem could benefit, and those projects continue to power some of the most advanced analytics platforms today.

In many ways, the Lakehouse era is built on foundations laid by Facebook’s engineers more than a decade ago.

Tags: