Apache Spark on Azure
Introduction to Apache Spark and Its Capabilities in Azure
Apache Spark capabilities within the Microsoft Azure platform enables users to harness Spark's powerful distributed computing features for big data processing and analytics. Before exploring the specific Azure services supporting Apache Spark, let’s understand what Apache Spark is and its core capabilities.
What is Apache Spark?
Apache Spark is an open-source, distributed computing engine designed for processing large-scale data sets quickly. It is widely used in big data and analytics applications due to its ability to handle diverse workloads efficiently, including:
Batch processing: Fast computation for large-scale datasets (Spark SQL).
Streaming analytics: Real-time data processing (Spark Streaming).
Graph processing: Complex graph computations for relationship analysis (Spark GraphX).
Machine learning (ML) and AI: Building scalable, data-driven models (SparkMLib)
Spark’s architecture is optimized for speed, scalability, and versatility, making it an essential tool for modern big data and AI applications.
Apache Spark is often compared to Apache Hadoop, especially its MapReduce component. The key difference between the two lies in their data processing approach.
Hadoop MapReduce writes intermediate data to disk between each processing step, which can significantly slow down performance, especially for iterative tasks.
Apache Spark, on the other hand, processes data in-memory whenever possible, reducing the need for disk I/O. This approach greatly enhances processing speed and efficiency, particularly for iterative computations such as those used in machine learning algorithms or graph processing.
This in-memory computation is one of Spark's standout features, making it a preferred choice for real-time and interactive analytics over traditional batch processing frameworks like Hadoop MapReduce.
How Apache Spark Works
Apache spark has a hierarchical primary/secondary architecture or master-worker architecture. The Spark Driver is the primary node that controls the cluster manager which manages the secondary nodes and return data results to application client
This architecture depends on the two main abstractions.
1. Resilient Distributed Datasets (RDD)
2. Directed Acyclic Graphs (DAG)
Resilient Distributed Datasets (RDDs)
RDDs achieve fault tolerance through lineage information. Spark keeps track of the series of transformations that led to the creation of each RDD, making it possible to recompute lost partitions if necessary. This mechanism ensures data integrity even in the event of hardware failures or cluster disruptions. Additionally, RDDs are inherently distributed, with their partitions spread across multiple nodes in a cluster, enabling high performance and scalability. There are two main ways to create RDDs. The first is by loading data from external sources such as Hadoop Distributed File System (HDFS), Azure Data Lake Gen2, or Amazon S3.
By abstracting complexities like data distribution, fault recovery, and parallelism, RDDs enable developers to focus on implementing transformations and actions to derive insights from data. As the backbone of Spark’s distributed computing capabilities, RDDs play a pivotal role in ensuring reliable, efficient, and scalable data processing.
Directed Acyclic Graph (DAG)
Apache Spark's architecture relies on a Directed Acyclic Graph (DAG) to represent the sequence of transformations applied to Resilient Distributed Datasets (RDDs). When a user submits a Spark application, the system constructs a DAG that outlines the various stages and tasks required for execution. This DAG is managed by the DAG Scheduler, which divides the graph into stages that can be executed in parallel wherever possible.
By organizing transformations into a DAG, Spark can optimize the execution plan, ensuring efficient resource utilization and faster processing. The DAG provides a clear overview of dependencies between operations, enabling Spark to streamline task execution, minimize data shuffling, and achieve better performance. This architectural design not only simplifies the execution process but also enhances the scalability and reliability of distributed data processing.
Source : Microsoft Documentation
Let’s discuss the Job execution flow of the Spark Application
App Submission: When a user submits an application, the driver program is started. The driver requests resources for the application from the cluster manager.
Job and DAG Creation: The driver splits the code into jobs, which are further divided into stages. Each stage is divided into smaller tasks.
Task Scheduling: The DAG scheduler organizes the jobs into stages, and the task scheduler assigns tasks to executors, considering resource availability and data locality.
Task Execution: Executors on worker nodes process the tasks, and the results are sent back to the driver, which aggregates and presents the final output to the user.
Throughout the flow Spark uses optimizations:
Lazy Evaluation: Delays processing until it necessary.
Data Locality: Processes data close to its location to minimize data transfer.
In-memory Computing: Stores intermediate results in memory to avoid disk I/O.
Speculative Execution: If a task is slow, Spark launches duplicate tasks on other nodes and uses the result from whichever finishes first.
This topic will be explored further with Azure Databricks example in the latter part of the blog
Up to now we know how the Apache spark works and now lets deep down our main topic how Azure will collaborate the Apache Spark. There are few ways this parallel processing framework support offering in Azure.
Apache Spark in Azure HDInsight
Spark pools in Azure Synapse Analytics
Apache Spark on Azure Databricks
Spark Activities in Azure Data Factory
This is a basic diagram of Architecture of the Azure HDInsight
Source : Microsoft Documentation
Practical Applications of Apache Spark in Azure Databricks
Let’s how to setup the Cluster using the Apache Spark in Azure Data Bricks Environment
Policy: A tool for workspace admins to manage user or group permissions for compute creation based on predefined rules. Benefits include controlling cluster settings, limiting cluster counts, enabling user-specific clusters, managing costs, and enforcing cluster-scoped libraries.
Multi-Node vs Single Node: Selecting either option determines the compute setup for task execution.
Single Node: Creates one compute instance for both the worker and driver node.
Multi-Node: Creates one driver node and multiple worker nodes, configurable with a range of minimum and maximum workers. This can be set the auto scaling depending on the workload has it should also depend on the mix and max workers
Additionally, you can enable spot instances, if available, to utilize on-demand capacity and optimize costs and also you can define the time span how compute will terminate after the specified time of inactivity.
Access Mode: There are three types of access modes: Single User, Shared, and No Isolation Shared.
Single User: The cluster runs workloads for a single user.
Shared: Multiple users can share the cluster to run workloads, secured by Unity Catalog (details about the Unity Catalog will be discussed in a separate article). This mode supports only SQL, Scala, and Python.
No Isolation Shared: Multiple users can share the cluster without isolation, allowing workloads in any language.
Databricks Runtime Version: Allows you to select a runtime image for creating a cluster. Specific versions of these images are outlined in the Databricks guide.
Above are the basic of creating the cluster for the Apache Cluster creation on the Azure Databricks and let find out what are the other functionality related to Spark Architecture which defined the Azure data bricks.
This is a simple execution of load the data frame and remove duplicates in the getting the number of records in the dataframe. Using this example let see how the spark architecture works on here.
In the first step of loading data into a CSV or a Spark DataFrame, the data is not actually loaded into memory because Spark uses lazy evaluation. Instead, it creates a Directed Acyclic Graph (DAG) logical plan (as shown in Image A1). The DAG contains all the necessary information required to execute the plan.
Each individual step in the DAG is associated with an RDD (as shown in Image A2), which is the fundamental data structure in Apache Spark. This means that RDDs are how Spark organizes data in memory. However, at this stage, the data is not yet loaded into memory—only the logical plan has been created, as mentioned earlier.
Once an RDD is created (Image A2), it is immutable. After its creation, it cannot be modified; instead, any transformations will create a new RDD while the old one remains unchanged. This immutability ensures fault tolerance. For example, if an RDD is lost, Spark can recompute it using lineage information, which tracks how the RDD was created from another RDD.
Spark is lazy by nature, meaning it does not execute code immediately when it is run. Instead, it creates a logical plan based on a Directed Acyclic Graph (DAG). This DAG continues to grow as additional transformations are added to the plan. The execution of the DAG only happens when a Spark Action Command is encountered in the code.
Some common Spark Action Commands include collect(), cont(), take(), reduce(), first(), and behaviour is referred to as lazy evaluation in Spark.
In the below transformation code drop duplicate does not run the any spark job and only executed for the count action command.
Let’s clarify this more
When the code is submitted to the driver program, the DAG logical plan is divided into multiple stages, as shown in Image B1. These stages are further broken down into multiple tasks, with each task corresponding to a partition of the data. For example, if a CSV file is split into partitions, two tasks will be created for each stage
Once the DAG is divided into tasks, the driver program splits and assigns these tasks to the executors on the worker nodes (Image B2). The executors on the worker nodes are responsible for executing the tasks assigned to them by the driver. The driver program monitors the execution process and collects the results from the worker nodes.
To perform this, the driver program submits the DAG logic to the cluster manager, requesting resources for the worker nodes. Upon receiving the DAG from the driver, the cluster manager allocates resources to the worker nodes and schedules tasks based on the availability of executors on the worker nodes. Additionally, the cluster manager ensures that tasks are distributed efficiently to optimize performance and speed up the process
Once the worker nodes receive the required resources and tasks, each executor on the worker nodes processes its assigned partition of the CSV file. The first step involves reading the CSV file from the stage into local memory. Next, the executors apply the dropDuplicates transformation on the data. Finally, the executors count the rows in their respective partitions and send this count back to the driver.
All the worker nodes execute their tasks in parallel, significantly speeding up the process. Finally, the driver aggregates the results from the worker nodes to calculate the total row count.
Conclusion
Apache Spark on Azure represents a powerful combination for big data analytics, offering scalability, speed, and flexibility to process vast datasets efficiently. Its distributed architecture enables real-time processing, while Azure services like Azure Databricks enhance its usability by providing a seamless environment for collaboration and integration.
By leveraging Azure Databricks, businesses can unlock actionable insights and streamline their data workflows, driving innovation and competitive advantage. Whether you're analyzing streaming data, building machine learning models, or processing batch workloads, Apache Spark on Azure empowers data professionals to deliver results at scale.
As the demand for big data solutions continues to grow, adopting a Spark-based architecture in Azure ensures you're equipped to tackle modern data challenges with confidence.