Geek Logbook

Tech sea log book

Batch Means Two Different Things: Why the Term Became Confusing in Data Engineering

In data systems, some of the most common words are also the most overloaded. Few terms illustrate this better than batch.

Historically, batch processing described a very specific operating model: work was accumulated, grouped, and executed later, usually without direct user interaction. In contrast, online systems handled requests interactively, responding as operations arrived. This distinction shaped the vocabulary behind terms such as OLTP and, later, OLAP.

But in modern data engineering, the word batch is used differently. Today, engineers often say they run “batch jobs in Spark” to extract data from transactional systems, transform it, and load it into analytical platforms. In that setting, batch no longer refers to the user-facing behavior of the original system. It refers to the execution model of the data pipeline.

That shift creates confusion. The term did not disappear. It started operating at two different levels.

The Original Meaning of Batch

In its historical sense, batch described a system that did not process each request immediately as it arrived from a user or terminal. Instead, work was collected over a period of time and processed later as a group.

A payroll run is a classic example. Employee records and worked hours might be gathered all week, but salaries would be calculated in a scheduled run at the end of the period. The key characteristics were delay, accumulation, and grouped execution.

In that world, the contrast with online was clear:

  • batch meant deferred processing
  • online meant interactive processing

This is the conceptual background behind Online Transaction Processing. The word online did not originally mean “through the internet.” It meant that the system was connected to active operations and responded directly to them, rather than waiting for a later batch cycle.

Where the Meaning Starts to Shift

Modern data architectures introduced a new context.

Today, many systems are operationally OLTP at the source. They register orders, payments, account updates, or customer events in real time. These are still online transactional systems.

However, downstream from those systems, data teams often build analytical environments through scheduled jobs:

  • extract data from OLTP databases
  • clean and standardize records
  • join and enrich datasets
  • load fact and dimension tables
  • publish data to a warehouse, lakehouse, or marts

These jobs are frequently described as batch pipelines.

At first glance, that seems contradictory. If the source is online and transactional, why are engineers also calling the process batch?

The answer is that the word now refers to something else.

Batch Now Operates at Two Levels

The cleanest way to understand the ambiguity is to distinguish two levels of meaning.

1. Batch as a property of the system’s interaction model

This is the historical meaning.

Here, batch describes how the system behaves relative to incoming work. The system accumulates tasks and processes them later. It is contrasted with an online system that serves requests interactively.

This is a system-level distinction.

2. Batch as a property of the pipeline’s execution model

This is the modern data engineering meaning.

Here, batch describes how a pipeline runs internally. A Spark job may execute every hour, every night, or every morning. It may process a full table, a partition, or a bounded slice of change data. In this case, batch says nothing directly about whether the source application is OLTP or OLAP. It only describes how the data movement or transformation is scheduled and executed.

This is a pipeline-level distinction.

Once these two levels are separated, the apparent contradiction disappears.

Why the Confusion Matters

The confusion matters because people often use the same word while referring to different architectural objects.

Consider the statement:

“Batch processing takes data from OLTP systems and loads it into OLAP.”

This is correct, but only if the terms are understood precisely:

  • OLTP describes the purpose and workload of the source system
  • OLAP describes the purpose and workload of the target system
  • batch describes the execution style of the data pipeline connecting them

The mistake is to treat batch and OLTP as if they were always competing labels in the same category. They are not.

One classifies the kind of workload a system serves.
The other classifies the way data processing is carried out.

A Better Mental Model

A more precise way to think about modern data architectures is to use two separate axes.

Axis 1: Workload type

  • OLTP: optimized for transactions, consistency, and operational updates
  • OLAP: optimized for analytical queries, aggregation, and exploration

Axis 2: Processing mode

  • batch: bounded, scheduled, grouped execution
  • streaming: continuous or event-driven processing
  • micro-batch: small repeated batches that approximate low-latency processing

With that model, the following combinations make sense:

  • OLTP source + batch ETL + OLAP warehouse
  • OLTP source + CDC streaming + OLAP serving layer
  • OLTP source + micro-batch transformations + near-real-time analytics

There is no contradiction because the terms belong to different dimensions.

What Actually Changed

The meaning of batch did not become wrong. It broadened.

Historically, it described the operating behavior of a system as experienced in relation to incoming work.
In modern practice, it often describes the execution pattern of the engineering layer that moves and reshapes data.

In other words, batch changed its object.

Before, it mostly described the behavior of the primary processing system.
Now, it often describes the behavior of the data pipeline surrounding that system.

That is why the term feels less sharp today. The word survived, but the architectural level at which it is applied shifted.

Conclusion

The confusion around batch in data engineering comes from a semantic migration, not from a conceptual failure.

In the historical vocabulary of computing, batch was the opposite of online: deferred and grouped processing versus immediate and interactive processing.

In modern data engineering, batch often refers to the way a pipeline executes, even when the source system is fully online and transactional.

So yes, the old opposition becomes blurred, but only because the term now operates at two levels:

  • as a description of how a system handles work
  • as a description of how a pipeline processes data

Once those levels are separated, the terminology becomes much more stable.

A useful summary is this:

OLTP and OLAP classify systems by purpose. Batch and streaming classify data processing by execution mode.

That distinction removes a large amount of ambiguity in discussions about modern data architecture.