Skip to main content

Command Palette

Search for a command to run...

Spark Simplified: Architecture for Data Enthusiasts

From DAGs to Executors: Understanding Spark Step by Step

Updated
2 min read
Spark Simplified: Architecture for Data Enthusiasts
V

I'm Varchasv, a Data Engineer working on enterprise data integration.

Currently on a 90-problem challenge to level up my technical skills and switch to a more development-focused role.

What I'm doing:

  • Solving 2 Leetcode problems daily (SQL + DSA).
  • Blogging about each problem.
  • Building in public.

My Goal - Land a better data engineering role by mid-2026.

Follow my journey !!

Apache Spark has become a cornerstone of big data processing, offering speed, scalability, and flexibility. Whether you're just starting out or looking to deepen your understanding, this guide will walk you through the core architecture of Spark in a clear and approachable way.

What is Apache Spark?

Apache Spark is distributed computing system designed for fast and flexible large-scale data processing. It supports Python, Java, R and Scala and provides high-level APIs for working with structured and unstructured data.

Spark Architecture

Core Components

  1. Driver Program [Brain]

  • Master node, controls everything.

  • Creates SparkContext.

  • Converts user code into a DAG of stages, then into tasks.

  • Sends tasks to executors.

  • Tracks and aggregates results.

  1. Cluster Manager [Machines]

  • Allocates resources to your Spark applications.

  • Types → Standalone, YARN, Mesos, K8s.

  1. Executors [Transformations and Actions]

  • Runs on worker node.

  • Has its own JVM.

  • Stores Cached data.

  • Returns result back to the driver.

Internal Concepts

  1. DAGs(Directed Acyclic Graphs) [Plan of Action]

  • Spark builds a DAG of transformations instead of executing line by line.

  • Stages are only triggered by actions (.collect() or .show()).

  • Helps with lazy evaluation and optimization.

  1. Catalyst Optimizer

  • Optimizes logical and physical query plans.

  • Applies rules to reorder filters, pushdowns predicates and optimize joins.

  1. Tungsten Engine [Execution Engine]

  • Uses off heap memory, code generation and Binary processing for speed.

PySpark Job Execution: step-by-step

  1. Driver runs your script and builds a DAG.

  2. Plan is optimized by Catalyst optimizer.

  3. Tungsten Engine compiles physical plan to JVM bytecode.

  4. Spark creates a stage for a sequence of narrow transformations bounded by wide transformations.

  5. Tasks (per partition) are scheduled and sent to executors.

  6. Executors processes the data and returns the result back to the driver.

Wrapping Up

Apache Spark might seem intimidating at first, but once you break it down into its core parts—drivers, executors, DAGs, and optimizers—it starts to make a lot more sense. Hopefully, this guide gave you a clearer picture of how Spark works behind the scenes.

If you're just getting started, don’t worry about mastering everything at once. Try running a few simple jobs, explore the Spark UI, and see how the pieces fit together. The more you experiment, the more intuitive it becomes.

Thanks for reading! If you found this helpful or have questions, feel free to drop a comment or reach out. I’d love to hear how you’re using Spark or what you'd like to learn next.

25 views

Data Engineering

Part 3 of 3

In this series, I will tell you all about different tools that a data engineer uses and how to master them.

Start from the beginning

Day 1 : Apache Spark Internals

Understanding how spark works