Geek Logbook

Tech sea log book

Window Functions vs JOIN in Spark: A Physical Plan Perspective

When solving analytical queries in Spark SQL, there are often multiple correct formulations. However, they do not produce equivalent execution plans.

This article compares two approaches to the same problem:

“Find the second highest salary per department, but only in departments with at least two employees.”

We analyze which approach is more efficient and why, from a physical plan standpoint.

Problem Statement

Given table:

Empleados(EmployeeID, Name, DepartmentID, Salary, JoinDate)

Goal:

  • Second highest salary per department
  • Only departments with ≥ 2 employees

Approach 1: CTE + GROUP BY + JOIN

WITH RankedSalaries AS (
    SELECT
        DepartmentID,
        Name,
        Salary,
        RANK() OVER (
            PARTITION BY DepartmentID
            ORDER BY Salary DESC
        ) AS ranking
    FROM Empleados
),
DepartmentsWithAtLeastTwo AS (
    SELECT DepartmentID
    FROM Empleados
    GROUP BY DepartmentID
    HAVING COUNT(*) >= 2
)
SELECT r.DepartmentID, r.Name, r.Salary
FROM RankedSalaries r
JOIN DepartmentsWithAtLeastTwo d
  ON r.DepartmentID = d.DepartmentID
WHERE r.ranking = 2;

What Happens Physically?

Spark builds two logical branches:

  1. Branch A → Window function (RANK)
  2. Branch B → HashAggregate (COUNT per department)

Then it performs a join.

Observed Physical Characteristics

  • Two scans of the base dataset
  • Two shuffle stages (one per branch)
  • A join (initially SortMergeJoin, optimized by AQE to BroadcastHashJoin)
  • More stages overall

Even if AQE optimizes the join strategy, the structural cost remains higher.

Approach 2: Window-Only Strategy

WITH ResumenEmpleados AS (
    SELECT
        DepartmentID,
        Salary,
        Name,
        COUNT(EmployeeID) OVER (
            PARTITION BY DepartmentID
        ) AS cant_empleado,
        RANK() OVER (
            PARTITION BY DepartmentID
            ORDER BY Salary DESC
        ) AS salario_ranking
    FROM Empleados
)
SELECT
    DepartmentID,
    Salary,
    Name
FROM ResumenEmpleados
WHERE cant_empleado > 1
  AND salario_ranking = 2;

What Happens Physically?

Spark:

  • Performs one scan
  • Executes one shuffle by DepartmentID
  • Applies window functions
  • Filters results

There is no separate aggregate branch and no join.

Observed Physical Characteristics

  • One scan
  • One shuffle
  • WindowGroupLimit optimization for Top-K per partition
  • No join

Why Approach 2 Is Generally More Efficient

Performance in Spark is dominated by:

  1. Shuffles
  2. Distributed joins
  3. Sort operations

Approach 2:

  • Eliminates one shuffle
  • Eliminates the join entirely
  • Keeps computation in a single pipeline

In distributed systems, avoiding joins is often more important than reducing SQL verbosity.


Conceptual Lesson

This comparison highlights a deeper principle:

Window functions enrich rows.
GROUP BY reduces rows.
JOIN reconciles datasets.

If your problem is analytical and row-level context must be preserved, window functions are often the most direct and efficient tool.


Important Note About RANK

RANK() can produce multiple rows for rank = 2 if salaries are tied.

If exactly one employee per department is required, use:

ROW_NUMBER() OVER (
    PARTITION BY DepartmentID
    ORDER BY Salary DESC, EmployeeID ASC
)

This ensures deterministic selection.


How to Inspect This Yourself

Use:

df.explain("formatted")

Or open Spark UI:

http://localhost:4040

Navigate to:

Jobs → Stages → DAG Visualization

Focus on:

  • Number of Exchange operators
  • Join strategies
  • Number of stages

Key Takeaway

The shorter SQL query is not necessarily faster.
The plan with fewer distributed operations usually is.

In this case, the window-only approach wins because it avoids structural duplication of work.

Tags: