Skip to main content

Command Palette

Search for a command to run...

Day 1 : Apache Spark Internals

Understanding how spark works

Updated
5 min read
V

I'm Varchasv, a Data Engineer working on enterprise data integration.

Currently on a 90-problem challenge to level up my technical skills and switch to a more development-focused role.

What I'm doing:

  • Solving 2 Leetcode problems daily (SQL + DSA).
  • Blogging about each problem.
  • Building in public.

My Goal - Land a better data engineering role by mid-2026.

Follow my journey !!

If you want to become a true PySpark SME, you need to go beyond writing transformations—you must understand how Spark actually executes them under the hood.

In this post, we’ll break down Spark’s execution model step by step:

  • Lazy Evaluation

  • DAG Construction

  • Job → Stage → Task hierarchy

  • Shuffles & Transformations

  • Execution Plans (Catalyst)

  • Real-world performance insights


1️⃣ Lazy Evaluation

Spark does not execute immediately, it :

  1. Records operations.

  2. Builds execution plan.

  3. Execute only when an action is called.

Example Code:

df.filter().groupBy().count()

What is Lazy Evaluation?

  • Allows full pipeline optimization.

  • Avoids unnecessary computations.

SME Insight:

Spark optimizes entire pipeline, not individual steps.


2️⃣ DAG (Directed Acyclic Graph)

DAG = Graph representing Transformations.

  • Nodes → operations (filter, join).

  • Edges → dependencies.

Properties:

  • Directed → execution order.

  • Acyclic → no loops.

  • Graph → chain of operations.

Why DAG is important?

  • Enables optimization (Catalyst).

  • Enables parallel processing.

  • Enables fault tolerance (lineage-based computation).

SME Insight:

Spark creates DAG → optimizes → converts to execution plan → runs


3️⃣ Execution Hierarchy

💡
Action → Job → Stage → Task

Job:

  • Triggered by an action.

  • Each Action = New Job.

  • Actions = count, show, collect, write, etc.

Stage:

  • Group of operations.

  • Splits occur at shuffle boundary.

Task:

  • Smallest unit of execution.

  • 1 Partition = 1 Task.

SME Insight:

  • Task = actual parallel execution.

  • Stage completes only when all the tasks with in are completed.


4️⃣Narrow vs Wide Transformations

Narrow Transformation

Data stays within the same partition.

Properties:

  • No shuffle.

  • Same stage.

  • Fast.

Examples: filter, map, select, etc.

Wide Transformation

Data moves across partitions (shuffle).

Properties:

  • Shuffle occurs.

  • New stage created.

  • Expensive

Examples: groupBy, join, distinct, etc.

SME Insight:

Wide Transformation = Main performance bottleneck


5️⃣Shuffle

Shuffle = Data redistribution across executors.

When it happens?

  • groupBy

  • join

  • sort

Internal Steps:

  1. Write intermediate data in disk.

  2. Transfer data across network.

  3. Read data in the next stage.

Causes high I/O + network cost

SME Insight:

  • Shuffle = most expensive operation.

  • Every shuffle = new stage.


6️⃣ Execution Plan

What is it?

Blueprint of execution

df.explain('true')

Types of Plans:

  1. Parsed Logical Plan → Raw User query.

  2. Analyzed Logical Plan → Column resolution + types.

  3. Optimized Logical Plan → Optimizations applied.

  4. Physical Plan → Actual execution strategy.


7️⃣ How to Read Physical Plan

Important Operations:

  1. Scan (Data Reading) → Scan Existing RDD

  2. Filter (Filtering rows) → filter(age > 30)

  3. HashAggregate (groupBy operation) → HashAggregate

  4. Exchange (Indicates Shuffle) → Exchange HashPartitioning(...)

SME Insight:

Exchange = performance cost point.


8️⃣ Example Flow (Understanding Execution)

Code:

df.filter(df.amount > 200)\
  .groupBy('category') \
  .count()

Execution Flow:

  1. Filter (Narrow Transformation).

  2. groupBy (Wide Transformation).

  3. DAG split into stages.

  4. Execution begins after action (count)

Observations:

Step Observations
Filter Same Stage
Group By New Stage
Exchange Shuffle
HashAggregate Aggregation

9️⃣Common Mistakes

  • Writing PySpark without checking plan. ❌

  • Always analyze .explain()


🔟Optimizations on partitions

Partitioning Sizing:

Target = 100-300 MB per partition.

Formula:

Partitions = Data Size / 200 MB

Best Practice:

spark.conf.set('spark.sql.adaptive.enabled', 'true')
spark.conf.set('spark.sql.adaptive.coalesceParitions.enabled', 'true')

⏸️SME Thinking Approach

Always Think:

  1. Will this cause Shuffle?

  2. How many Stages?

  3. How many Tasks?

  4. Where is the performance bottleneck?


🚀 Final Thoughts

If you take away just one thing from this post, let it be this:

Spark is not just executing your code — it is building, optimizing, and orchestrating a distributed system behind the scenes.

Every transformation you write becomes part of a larger execution plan.
Every shuffle, every stage split, every task — directly impacts performance.

Once you start thinking in terms of:

  • DAGs instead of code

  • Stages instead of steps

  • Shuffles instead of operations

You stop being just a PySpark user…
and start thinking like a distributed systems engineer.


🧠 What You Should Think Going Forward

Before writing any PySpark pipeline, ask:

  • Will this cause a shuffle?

  • How many stages will this create?

  • Can I reduce partition movement?

  • Is this execution plan optimal?


🔥 What’s Next

In the next blog, we’ll go one layer deeper:

👉 Spark Architecture

  • Driver vs Executor

  • Cluster behavior

  • Memory & resource execution


⚡ Because writing PySpark is easy.
Writing scalable PySpark is what makes you an SME.

Data Engineering

Part 1 of 3

In this series, I will tell you all about different tools that a data engineer uses and how to master them.

Up next

80 Data Engineering Interview Questions: SQL & Python Guide [2026]

Master SQL and Python with These Essential Data Engineering Interview Questions [2026]