Apache Spark on Azure

Anuradha Nanayakkara

Software Architect | Tech Lead | Senior Software Engineer | Big Data Analytics Engineer | Data Engineer

Published Dec 16, 2024

Introduction to Apache Spark and Its Capabilities in Azure

Apache Spark capabilities within the Microsoft Azure platform enables users to harness Spark's powerful distributed computing features for big data processing and analytics. Before exploring the specific Azure services supporting Apache Spark, let’s understand what Apache Spark is and its core capabilities.

What is Apache Spark?

Apache Spark is an open-source, distributed computing engine designed for processing large-scale data sets quickly. It is widely used in big data and analytics applications due to its ability to handle diverse workloads efficiently, including:

Batch processing: Fast computation for large-scale datasets (Spark SQL).
Streaming analytics: Real-time data processing (Spark Streaming).
Graph processing: Complex graph computations for relationship analysis (Spark GraphX).
Machine learning (ML) and AI: Building scalable, data-driven models (SparkMLib)

Spark’s architecture is optimized for speed, scalability, and versatility, making it an essential tool for modern big data and AI applications.

Apache Spark is often compared to Apache Hadoop, especially its MapReduce component. The key difference between the two lies in their data processing approach.

Hadoop MapReduce writes intermediate data to disk between each processing step, which can significantly slow down performance, especially for iterative tasks.
Apache Spark, on the other hand, processes data in-memory whenever possible, reducing the need for disk I/O. This approach greatly enhances processing speed and efficiency, particularly for iterative computations such as those used in machine learning algorithms or graph processing.

This in-memory computation is one of Spark's standout features, making it a preferred choice for real-time and interactive analytics over traditional batch processing frameworks like Hadoop MapReduce.

How Apache Spark Works

Apache spark has a hierarchical primary/secondary architecture or master-worker architecture. The Spark Driver is the primary node that controls the cluster manager which manages the secondary nodes and return data results to application client

This architecture depends on the two main abstractions.

1. Resilient Distributed Datasets (RDD)

2. Directed Acyclic Graphs (DAG)

Resilient Distributed Datasets (RDDs)

RDDs achieve fault tolerance through lineage information. Spark keeps track of the series of transformations that led to the creation of each RDD, making it possible to recompute lost partitions if necessary. This mechanism ensures data integrity even in the event of hardware failures or cluster disruptions. Additionally, RDDs are inherently distributed, with their partitions spread across multiple nodes in a cluster, enabling high performance and scalability. There are two main ways to create RDDs. The first is by loading data from external sources such as Hadoop Distributed File System (HDFS), Azure Data Lake Gen2, or Amazon S3.

By abstracting complexities like data distribution, fault recovery, and parallelism, RDDs enable developers to focus on implementing transformations and actions to derive insights from data. As the backbone of Spark’s distributed computing capabilities, RDDs play a pivotal role in ensuring reliable, efficient, and scalable data processing.

Directed Acyclic Graph (DAG)

Apache Spark's architecture relies on a Directed Acyclic Graph (DAG) to represent the sequence of transformations applied to Resilient Distributed Datasets (RDDs). When a user submits a Spark application, the system constructs a DAG that outlines the various stages and tasks required for execution. This DAG is managed by the DAG Scheduler, which divides the graph into stages that can be executed in parallel wherever possible.

By organizing transformations into a DAG, Spark can optimize the execution plan, ensuring efficient resource utilization and faster processing. The DAG provides a clear overview of dependencies between operations, enabling Spark to streamline task execution, minimize data shuffling, and achieve better performance. This architectural design not only simplifies the execution process but also enhances the scalability and reliability of distributed data processing.

Source : Microsoft Documentation

Let’s discuss the Job execution flow of the Spark Application

App Submission: When a user submits an application, the driver program is started. The driver requests resources for the application from the cluster manager.
Job and DAG Creation: The driver splits the code into jobs, which are further divided into stages. Each stage is divided into smaller tasks.
Task Scheduling: The DAG scheduler organizes the jobs into stages, and the task scheduler assigns tasks to executors, considering resource availability and data locality.
Task Execution: Executors on worker nodes process the tasks, and the results are sent back to the driver, which aggregates and presents the final output to the user.

Throughout the flow Spark uses optimizations:

Lazy Evaluation: Delays processing until it necessary.
Data Locality: Processes data close to its location to minimize data transfer.
In-memory Computing: Stores intermediate results in memory to avoid disk I/O.
Speculative Execution: If a task is slow, Spark launches duplicate tasks on other nodes and uses the result from whichever finishes first.

This topic will be explored further with Azure Databricks example in the latter part of the blog

Up to now we know how the Apache spark works and now lets deep down our main topic how Azure will collaborate the Apache Spark. There are few ways this parallel processing framework support offering in Azure.

Apache Spark in Azure HDInsight
Spark pools in Azure Synapse Analytics
Apache Spark on Azure Databricks
Spark Activities in Azure Data Factory

This is a basic diagram of Architecture of the Azure HDInsight

Source : Microsoft Documentation

Practical Applications of Apache Spark in Azure Databricks

Let’s how to setup the Cluster using the Apache Spark in Azure Data Bricks Environment

Policy: A tool for workspace admins to manage user or group permissions for compute creation based on predefined rules. Benefits include controlling cluster settings, limiting cluster counts, enabling user-specific clusters, managing costs, and enforcing cluster-scoped libraries.

Multi-Node vs Single Node: Selecting either option determines the compute setup for task execution.

Single Node: Creates one compute instance for both the worker and driver node.
Multi-Node: Creates one driver node and multiple worker nodes, configurable with a range of minimum and maximum workers. This can be set the auto scaling depending on the workload has it should also depend on the mix and max workers

Additionally, you can enable spot instances, if available, to utilize on-demand capacity and optimize costs and also you can define the time span how compute will terminate after the specified time of inactivity.

Access Mode: There are three types of access modes: Single User, Shared, and No Isolation Shared.

Single User: The cluster runs workloads for a single user.
Shared: Multiple users can share the cluster to run workloads, secured by Unity Catalog (details about the Unity Catalog will be discussed in a separate article). This mode supports only SQL, Scala, and Python.
No Isolation Shared: Multiple users can share the cluster without isolation, allowing workloads in any language.

Databricks Runtime Version: Allows you to select a runtime image for creating a cluster. Specific versions of these images are outlined in the Databricks guide.

Above are the basic of creating the cluster for the Apache Cluster creation on the Azure Databricks and let find out what are the other functionality related to Spark Architecture which defined the Azure data bricks.

This is a simple execution of load the data frame and remove duplicates in the getting the number of records in the dataframe. Using this example let see how the spark architecture works on here.

In the first step of loading data into a CSV or a Spark DataFrame, the data is not actually loaded into memory because Spark uses lazy evaluation. Instead, it creates a Directed Acyclic Graph (DAG) logical plan (as shown in Image A1). The DAG contains all the necessary information required to execute the plan.

Each individual step in the DAG is associated with an RDD (as shown in Image A2), which is the fundamental data structure in Apache Spark. This means that RDDs are how Spark organizes data in memory. However, at this stage, the data is not yet loaded into memory—only the logical plan has been created, as mentioned earlier.

Once an RDD is created (Image A2), it is immutable. After its creation, it cannot be modified; instead, any transformations will create a new RDD while the old one remains unchanged. This immutability ensures fault tolerance. For example, if an RDD is lost, Spark can recompute it using lineage information, which tracks how the RDD was created from another RDD.

Spark is lazy by nature, meaning it does not execute code immediately when it is run. Instead, it creates a logical plan based on a Directed Acyclic Graph (DAG). This DAG continues to grow as additional transformations are added to the plan. The execution of the DAG only happens when a Spark Action Command is encountered in the code.

Some common Spark Action Commands include collect(), cont(), take(), reduce(), first(), and behaviour is referred to as lazy evaluation in Spark.

In the below transformation code drop duplicate does not run the any spark job and only executed for the count action command.

Let’s clarify this more

When the code is submitted to the driver program, the DAG logical plan is divided into multiple stages, as shown in Image B1. These stages are further broken down into multiple tasks, with each task corresponding to a partition of the data. For example, if a CSV file is split into partitions, two tasks will be created for each stage

Once the DAG is divided into tasks, the driver program splits and assigns these tasks to the executors on the worker nodes (Image B2). The executors on the worker nodes are responsible for executing the tasks assigned to them by the driver. The driver program monitors the execution process and collects the results from the worker nodes.

To perform this, the driver program submits the DAG logic to the cluster manager, requesting resources for the worker nodes. Upon receiving the DAG from the driver, the cluster manager allocates resources to the worker nodes and schedules tasks based on the availability of executors on the worker nodes. Additionally, the cluster manager ensures that tasks are distributed efficiently to optimize performance and speed up the process

Once the worker nodes receive the required resources and tasks, each executor on the worker nodes processes its assigned partition of the CSV file. The first step involves reading the CSV file from the stage into local memory. Next, the executors apply the dropDuplicates transformation on the data. Finally, the executors count the rows in their respective partitions and send this count back to the driver.

All the worker nodes execute their tasks in parallel, significantly speeding up the process. Finally, the driver aggregates the results from the worker nodes to calculate the total row count.

Conclusion

Apache Spark on Azure represents a powerful combination for big data analytics, offering scalability, speed, and flexibility to process vast datasets efficiently. Its distributed architecture enables real-time processing, while Azure services like Azure Databricks enhance its usability by providing a seamless environment for collaboration and integration.

By leveraging Azure Databricks, businesses can unlock actionable insights and streamline their data workflows, driving innovation and competitive advantage. Whether you're analyzing streaming data, building machine learning models, or processing batch workloads, Apache Spark on Azure empowers data professionals to deliver results at scale.

As the demand for big data solutions continues to grow, adopting a Spark-based architecture in Azure ensures you're equipped to tackle modern data challenges with confidence.

Apache Spark on Azure

Anuradha Nanayakkara

Software Architect | Tech Lead | Senior Software Engineer | Big Data Analytics Engineer | Data Engineer

What is Apache Spark?

How Apache Spark Works

Practical Applications of Apache Spark in Azure Databricks

Conclusion

More articles by this author

Others also viewed

Azure Blob Storage

Common industry use cases for NoSQL with Azure Cosmos DB

HLD Designing a Distributed Counter: A System Design Deep Dive

The Hidden Optimisation Behind Spark’s mapPartitions

Google DataProc aka Apache Spark & Hadoop Service

Building a Data Pipeline with AWS Glue, DataBrew, and S3

The 5 Biggest Mistakes When Choosing a NoSQL Database

Exploring the Functionality of MapReduce, Apache Spark and Hive in the Distributed Computing Paradigm

Apache Spark on AWS

Building Multi Agent AI Systems with Azure Cosmos DB NoSQL/vCore - Key challenges and considerations

Explore topics

What is Apache Spark?

How Apache Spark Works

Practical Applications of Apache Spark in Azure Databricks

Conclusion

Azure Synapse Analytics vs Microsoft Fabric.

Dec 9, 2024

Others also viewed

Azure Blob Storage

Common industry use cases for NoSQL with Azure Cosmos DB

HLD Designing a Distributed Counter: A System Design Deep Dive

The Hidden Optimisation Behind Spark’s mapPartitions

Google DataProc aka Apache Spark & Hadoop Service

Building a Data Pipeline with AWS Glue, DataBrew, and S3

The 5 Biggest Mistakes When Choosing a NoSQL Database

Exploring the Functionality of MapReduce, Apache Spark and Hive in the Distributed Computing Paradigm

Apache Spark on AWS

Building Multi Agent AI Systems with Azure Cosmos DB NoSQL/vCore - Key challenges and considerations

Explore topics