Window Functions vs JOIN in Spark: A Physical Plan Perspective
When solving analytical queries in Spark SQL, there are often multiple correct formulations. However, they do not produce equivalent execution plans.
This article compares two approaches to the same problem:
“Find the second highest salary per department, but only in departments with at least two employees.”
We analyze which approach is more efficient and why, from a physical plan standpoint.
Problem Statement
Given table:
Empleados(EmployeeID, Name, DepartmentID, Salary, JoinDate)
Goal:
- Second highest salary per department
- Only departments with ≥ 2 employees
Approach 1: CTE + GROUP BY + JOIN
WITH RankedSalaries AS (
SELECT
DepartmentID,
Name,
Salary,
RANK() OVER (
PARTITION BY DepartmentID
ORDER BY Salary DESC
) AS ranking
FROM Empleados
),
DepartmentsWithAtLeastTwo AS (
SELECT DepartmentID
FROM Empleados
GROUP BY DepartmentID
HAVING COUNT(*) >= 2
)
SELECT r.DepartmentID, r.Name, r.Salary
FROM RankedSalaries r
JOIN DepartmentsWithAtLeastTwo d
ON r.DepartmentID = d.DepartmentID
WHERE r.ranking = 2;
What Happens Physically?
Spark builds two logical branches:
- Branch A → Window function (RANK)
- Branch B → HashAggregate (COUNT per department)
Then it performs a join.
Observed Physical Characteristics
- Two scans of the base dataset
- Two shuffle stages (one per branch)
- A join (initially SortMergeJoin, optimized by AQE to BroadcastHashJoin)
- More stages overall
Even if AQE optimizes the join strategy, the structural cost remains higher.
Approach 2: Window-Only Strategy
WITH ResumenEmpleados AS (
SELECT
DepartmentID,
Salary,
Name,
COUNT(EmployeeID) OVER (
PARTITION BY DepartmentID
) AS cant_empleado,
RANK() OVER (
PARTITION BY DepartmentID
ORDER BY Salary DESC
) AS salario_ranking
FROM Empleados
)
SELECT
DepartmentID,
Salary,
Name
FROM ResumenEmpleados
WHERE cant_empleado > 1
AND salario_ranking = 2;
What Happens Physically?
Spark:
- Performs one scan
- Executes one shuffle by DepartmentID
- Applies window functions
- Filters results
There is no separate aggregate branch and no join.
Observed Physical Characteristics
- One scan
- One shuffle
- WindowGroupLimit optimization for Top-K per partition
- No join
Why Approach 2 Is Generally More Efficient
Performance in Spark is dominated by:
- Shuffles
- Distributed joins
- Sort operations
Approach 2:
- Eliminates one shuffle
- Eliminates the join entirely
- Keeps computation in a single pipeline
In distributed systems, avoiding joins is often more important than reducing SQL verbosity.
Conceptual Lesson
This comparison highlights a deeper principle:
Window functions enrich rows.
GROUP BY reduces rows.
JOIN reconciles datasets.
If your problem is analytical and row-level context must be preserved, window functions are often the most direct and efficient tool.
Important Note About RANK
RANK() can produce multiple rows for rank = 2 if salaries are tied.
If exactly one employee per department is required, use:
ROW_NUMBER() OVER (
PARTITION BY DepartmentID
ORDER BY Salary DESC, EmployeeID ASC
)
This ensures deterministic selection.
How to Inspect This Yourself
Use:
df.explain("formatted")
Or open Spark UI:
http://localhost:4040
Navigate to:
Jobs → Stages → DAG Visualization
Focus on:
- Number of Exchange operators
- Join strategies
- Number of stages
Key Takeaway
The shorter SQL query is not necessarily faster.
The plan with fewer distributed operations usually is.
In this case, the window-only approach wins because it avoids structural duplication of work.