By - Geek Logbook
Posted on 2026-03-01
Posted in Programming

Window Functions vs JOIN in Spark: A Physical Plan Perspective

When solving analytical queries in Spark SQL, there are often multiple correct formulations. However, they do not produce equivalent execution plans.

This article compares two approaches to the same problem:

“Find the second highest salary per department, but only in departments with at least two employees.”

We analyze which approach is more efficient and why, from a physical plan standpoint.

Problem Statement

Given table:

Empleados(EmployeeID, Name, DepartmentID, Salary, JoinDate)

Goal:

Second highest salary per department
Only departments with ≥ 2 employees

Approach 1: CTE + GROUP BY + JOIN

WITH RankedSalaries AS (
    SELECT
        DepartmentID,
        Name,
        Salary,
        RANK() OVER (
            PARTITION BY DepartmentID
            ORDER BY Salary DESC
        ) AS ranking
    FROM Empleados
),
DepartmentsWithAtLeastTwo AS (
    SELECT DepartmentID
    FROM Empleados
    GROUP BY DepartmentID
    HAVING COUNT(*) >= 2
)
SELECT r.DepartmentID, r.Name, r.Salary
FROM RankedSalaries r
JOIN DepartmentsWithAtLeastTwo d
  ON r.DepartmentID = d.DepartmentID
WHERE r.ranking = 2;

What Happens Physically?

Spark builds two logical branches:

Branch A → Window function (RANK)
Branch B → HashAggregate (COUNT per department)

Then it performs a join.

Observed Physical Characteristics

Two scans of the base dataset
Two shuffle stages (one per branch)
A join (initially SortMergeJoin, optimized by AQE to BroadcastHashJoin)
More stages overall

Even if AQE optimizes the join strategy, the structural cost remains higher.

Approach 2: Window-Only Strategy

WITH ResumenEmpleados AS (
    SELECT
        DepartmentID,
        Salary,
        Name,
        COUNT(EmployeeID) OVER (
            PARTITION BY DepartmentID
        ) AS cant_empleado,
        RANK() OVER (
            PARTITION BY DepartmentID
            ORDER BY Salary DESC
        ) AS salario_ranking
    FROM Empleados
)
SELECT
    DepartmentID,
    Salary,
    Name
FROM ResumenEmpleados
WHERE cant_empleado > 1
  AND salario_ranking = 2;

What Happens Physically?

Spark:

Performs one scan
Executes one shuffle by DepartmentID
Applies window functions
Filters results

There is no separate aggregate branch and no join.

Observed Physical Characteristics

One scan
One shuffle
WindowGroupLimit optimization for Top-K per partition
No join

Why Approach 2 Is Generally More Efficient

Performance in Spark is dominated by:

Shuffles
Distributed joins
Sort operations

Approach 2:

Eliminates one shuffle
Eliminates the join entirely
Keeps computation in a single pipeline

In distributed systems, avoiding joins is often more important than reducing SQL verbosity.

Conceptual Lesson

This comparison highlights a deeper principle:

Window functions enrich rows.
GROUP BY reduces rows.
JOIN reconciles datasets.

If your problem is analytical and row-level context must be preserved, window functions are often the most direct and efficient tool.

Important Note About RANK

RANK() can produce multiple rows for rank = 2 if salaries are tied.

If exactly one employee per department is required, use:

ROW_NUMBER() OVER (
    PARTITION BY DepartmentID
    ORDER BY Salary DESC, EmployeeID ASC
)

This ensures deterministic selection.

How to Inspect This Yourself

Use:

df.explain("formatted")

Or open Spark UI:

http://localhost:4040

Navigate to:

Jobs → Stages → DAG Visualization

Focus on:

Number of Exchange operators
Join strategies
Number of stages

Key Takeaway

The shorter SQL query is not necessarily faster.
The plan with fewer distributed operations usually is.

In this case, the window-only approach wins because it avoids structural duplication of work.

Tags:SQL

Geek Logbook

Recent Posts

Categories

Archives

Window Functions vs JOIN in Spark: A Physical Plan Perspective

Problem Statement

Approach 1: CTE + GROUP BY + JOIN

What Happens Physically?

Observed Physical Characteristics

Approach 2: Window-Only Strategy

What Happens Physically?

Observed Physical Characteristics

Why Approach 2 Is Generally More Efficient

Conceptual Lesson

Important Note About RANK

How to Inspect This Yourself

Key Takeaway

Previous Article

Next Article