Learn Spark Internals Like an SME: DAG & Execution

If you want to become a true PySpark SME, you need to go beyond writing transformations—you must understand how Spark actually executes them under the hood.

In this post, we’ll break down Spark’s execution model step by step:

Lazy Evaluation
DAG Construction
Job → Stage → Task hierarchy
Shuffles & Transformations
Execution Plans (Catalyst)
Real-world performance insights

1️⃣ Lazy Evaluation

Spark does not execute immediately, it :

Records operations.
Builds execution plan.
Execute only when an action is called.

Example Code:

df.filter().groupBy().count()

What is Lazy Evaluation?

Allows full pipeline optimization.
Avoids unnecessary computations.

SME Insight:

Spark optimizes entire pipeline, not individual steps.

2️⃣ DAG (Directed Acyclic Graph)

DAG = Graph representing Transformations.

Nodes → operations (filter, join).
Edges → dependencies.

Properties:

Directed → execution order.
Acyclic → no loops.
Graph → chain of operations.

Why DAG is important?

Enables optimization (Catalyst).
Enables parallel processing.
Enables fault tolerance (lineage-based computation).

SME Insight:

Spark creates DAG → optimizes → converts to execution plan → runs

3️⃣ Execution Hierarchy

💡

Action → Job → Stage → Task

Job:

Triggered by an action.
Each Action = New Job.
Actions = count, show, collect, write, etc.

Stage:

Group of operations.
Splits occur at shuffle boundary.

Task:

Smallest unit of execution.
1 Partition = 1 Task.

SME Insight:

Task = actual parallel execution.

Stage completes only when all the tasks with in are completed.

4️⃣Narrow vs Wide Transformations

Narrow Transformation

Data stays within the same partition.

Properties:

No shuffle.
Same stage.
Fast.

Examples: filter, map, select, etc.

Wide Transformation

Data moves across partitions (shuffle).

Properties:

Shuffle occurs.
New stage created.
Expensive

Examples: groupBy, join, distinct, etc.

SME Insight:

Wide Transformation = Main performance bottleneck

5️⃣Shuffle

Shuffle = Data redistribution across executors.

When it happens?

groupBy
join
sort

Internal Steps:

Write intermediate data in disk.
Transfer data across network.
Read data in the next stage.

Causes high I/O + network cost

SME Insight:

Shuffle = most expensive operation.

Every shuffle = new stage.

6️⃣ Execution Plan

What is it?

Blueprint of execution

df.explain('true')

Types of Plans:

Parsed Logical Plan → Raw User query.
Analyzed Logical Plan → Column resolution + types.
Optimized Logical Plan → Optimizations applied.
Physical Plan → Actual execution strategy.

7️⃣ How to Read Physical Plan

Important Operations:

Scan (Data Reading) → Scan Existing RDD
Filter (Filtering rows) → filter(age > 30)
HashAggregate (groupBy operation) → HashAggregate
Exchange (Indicates Shuffle) → Exchange HashPartitioning(...)

SME Insight:

Exchange = performance cost point.

8️⃣ Example Flow (Understanding Execution)

Code:

df.filter(df.amount > 200)\
  .groupBy('category') \
  .count()

Execution Flow:

Filter (Narrow Transformation).
groupBy (Wide Transformation).
DAG split into stages.
Execution begins after action (count)

Observations:

Step	Observations
Filter	Same Stage
Group By	New Stage
Exchange	Shuffle
HashAggregate	Aggregation

9️⃣Common Mistakes

Writing PySpark without checking plan. ❌
Always analyze .explain() ✅

🔟Optimizations on partitions

Partitioning Sizing:

Target = 100-300 MB per partition.

Formula:

➕

Partitions = Data Size / 200 MB

Best Practice:

spark.conf.set('spark.sql.adaptive.enabled', 'true')
spark.conf.set('spark.sql.adaptive.coalesceParitions.enabled', 'true')

⏸️SME Thinking Approach

Always Think:

Will this cause Shuffle?
How many Stages?
How many Tasks?
Where is the performance bottleneck?

🚀 Final Thoughts

If you take away just one thing from this post, let it be this:

Spark is not just executing your code — it is building, optimizing, and orchestrating a distributed system behind the scenes.

Every transformation you write becomes part of a larger execution plan.
Every shuffle, every stage split, every task — directly impacts performance.

Once you start thinking in terms of:

DAGs instead of code
Stages instead of steps
Shuffles instead of operations

You stop being just a PySpark user…
and start thinking like a distributed systems engineer.

🧠 What You Should Think Going Forward

Before writing any PySpark pipeline, ask:

Will this cause a shuffle?
How many stages will this create?
Can I reduce partition movement?
Is this execution plan optimal?

🔥 What’s Next

In the next blog, we’ll go one layer deeper:

👉 Spark Architecture

Driver vs Executor
Cluster behavior
Memory & resource execution

⚡ Because writing PySpark is easy.
Writing scalable PySpark is what makes you an SME.

Command Palette

1️⃣ Lazy Evaluation

What is Lazy Evaluation?

SME Insight:

2️⃣ DAG (Directed Acyclic Graph)

Properties:

Why DAG is important?

SME Insight:

3️⃣ Execution Hierarchy

Job:

Stage:

Task:

SME Insight:

4️⃣Narrow vs Wide Transformations

Narrow Transformation

Wide Transformation

SME Insight:

5️⃣Shuffle

When it happens?

Internal Steps:

SME Insight:

6️⃣ Execution Plan

What is it?

Types of Plans:

7️⃣ How to Read Physical Plan

Important Operations:

SME Insight:

8️⃣ Example Flow (Understanding Execution)

Code:

Execution Flow:

Observations:

9️⃣Common Mistakes

🔟Optimizations on partitions

Partitioning Sizing:

Formula:

Best Practice:

⏸️SME Thinking Approach

🚀 Final Thoughts

🧠 What You Should Think Going Forward

🔥 What’s Next

Comments

Data Engineering

80 Data Engineering Interview Questions: SQL & Python Guide [2026]

More from this blog