Day 1 : Apache Spark Internals
Understanding how spark works
I'm Varchasv, a Data Engineer working on enterprise data integration.
Currently on a 90-problem challenge to level up my technical skills and switch to a more development-focused role.
What I'm doing:
- Solving 2 Leetcode problems daily (SQL + DSA).
- Blogging about each problem.
- Building in public.
My Goal - Land a better data engineering role by mid-2026.
Follow my journey !!
If you want to become a true PySpark SME, you need to go beyond writing transformations—you must understand how Spark actually executes them under the hood.
In this post, we’ll break down Spark’s execution model step by step:
Lazy Evaluation
DAG Construction
Job → Stage → Task hierarchy
Shuffles & Transformations
Execution Plans (Catalyst)
Real-world performance insights
1️⃣ Lazy Evaluation
Spark does not execute immediately, it :
Records operations.
Builds execution plan.
Execute only when an action is called.
Example Code:
df.filter().groupBy().count()
What is Lazy Evaluation?
Allows full pipeline optimization.
Avoids unnecessary computations.
SME Insight:
Spark optimizes entire pipeline, not individual steps.
2️⃣ DAG (Directed Acyclic Graph)
DAG = Graph representing Transformations.
Nodes → operations (filter, join).
Edges → dependencies.
Properties:
Directed → execution order.
Acyclic → no loops.
Graph → chain of operations.
Why DAG is important?
Enables optimization (Catalyst).
Enables parallel processing.
Enables fault tolerance (lineage-based computation).
SME Insight:
Spark creates DAG → optimizes → converts to execution plan → runs
3️⃣ Execution Hierarchy
Job:
Triggered by an action.
Each Action = New Job.
Actions = count, show, collect, write, etc.
Stage:
Group of operations.
Splits occur at shuffle boundary.
Task:
Smallest unit of execution.
1 Partition = 1 Task.
SME Insight:
Task = actual parallel execution.
Stage completes only when all the tasks with in are completed.
4️⃣Narrow vs Wide Transformations
Narrow Transformation
Data stays within the same partition.
Properties:
No shuffle.
Same stage.
Fast.
Examples: filter, map, select, etc.
Wide Transformation
Data moves across partitions (shuffle).
Properties:
Shuffle occurs.
New stage created.
Expensive
Examples: groupBy, join, distinct, etc.
SME Insight:
Wide Transformation = Main performance bottleneck
5️⃣Shuffle
Shuffle = Data redistribution across executors.
When it happens?
groupBy
join
sort
Internal Steps:
Write intermediate data in disk.
Transfer data across network.
Read data in the next stage.
Causes high I/O + network cost
SME Insight:
Shuffle = most expensive operation.
Every shuffle = new stage.
6️⃣ Execution Plan
What is it?
Blueprint of execution
df.explain('true')
Types of Plans:
Parsed Logical Plan → Raw User query.
Analyzed Logical Plan → Column resolution + types.
Optimized Logical Plan → Optimizations applied.
Physical Plan → Actual execution strategy.
7️⃣ How to Read Physical Plan
Important Operations:
Scan (Data Reading) →
Scan Existing RDDFilter (Filtering rows) →
filter(age > 30)HashAggregate (groupBy operation) →
HashAggregateExchange (Indicates Shuffle) →
Exchange HashPartitioning(...)
SME Insight:
Exchange = performance cost point.
8️⃣ Example Flow (Understanding Execution)
Code:
df.filter(df.amount > 200)\
.groupBy('category') \
.count()
Execution Flow:
Filter (Narrow Transformation).
groupBy (Wide Transformation).
DAG split into stages.
Execution begins after action (count)
Observations:
| Step | Observations |
|---|---|
| Filter | Same Stage |
| Group By | New Stage |
| Exchange | Shuffle |
| HashAggregate | Aggregation |
9️⃣Common Mistakes
Writing PySpark without checking plan. ❌
Always analyze
.explain()✅
🔟Optimizations on partitions
Partitioning Sizing:
Target = 100-300 MB per partition.
Formula:
Best Practice:
spark.conf.set('spark.sql.adaptive.enabled', 'true')
spark.conf.set('spark.sql.adaptive.coalesceParitions.enabled', 'true')
⏸️SME Thinking Approach
Always Think:
Will this cause Shuffle?
How many Stages?
How many Tasks?
Where is the performance bottleneck?
🚀 Final Thoughts
If you take away just one thing from this post, let it be this:
Spark is not just executing your code — it is building, optimizing, and orchestrating a distributed system behind the scenes.
Every transformation you write becomes part of a larger execution plan.
Every shuffle, every stage split, every task — directly impacts performance.
Once you start thinking in terms of:
DAGs instead of code
Stages instead of steps
Shuffles instead of operations
You stop being just a PySpark user…
and start thinking like a distributed systems engineer.
🧠 What You Should Think Going Forward
Before writing any PySpark pipeline, ask:
Will this cause a shuffle?
How many stages will this create?
Can I reduce partition movement?
Is this execution plan optimal?
🔥 What’s Next
In the next blog, we’ll go one layer deeper:
👉 Spark Architecture
Driver vs Executor
Cluster behavior
Memory & resource execution
⚡ Because writing PySpark is easy.
Writing scalable PySpark is what makes you an SME.