Overview of Azure Data Factory Components

Nguyễn Tuấn Dương

⚡️ Data Engineer - GDS | VNGGames

Published Nov 25, 2024

📖 Overview

Organizations often deal with massive volumes of data that are often unstructured, scattered across various storage systems, including relational, non-relational, and other formats. A key challenge for these organizations is transforming this raw data into well-organized, meaningful insights that can drive business decisions.

Azure Data Factory, a fully managed cloud service, offers a robust solution for transforming disorganized data into actionable insights. It simplifies the management of complex hybrid workflows, including extraction, transformation, and loading (ETL), extract-load-transform (ELT), and data integration tasks.

Azure Data Factory (ADF) is a powerful cloud-based data integration service that allows organizations to orchestrate and automate data workflows across various sources and destinations. Understanding its core components is essential to leveraging its full potential. Here’s an overview of the key components of Azure Data Factory:

1. Pipeline:

Pipelines are logical groupings of activities that collectively perform a specific task or process(Data flow). These activities work together to accomplish a broader workflow, such as moving or transforming data.

Activity Pane (Left):

Displays a list of activities grouped by categories, such as: Move and Transform or Control.

Design Canvas (Right):

A workspace to visually build workflows by adding and connecting activities.

Configuration Panes (Bottom):

Parameters: Define dynamic inputs.
Variables: Store and manipulate values during runtime.
Settings: Customize activity configurations.
Output: View activity execution results.

2. Activities

An activity represents a single processing step within a pipeline. Azure Data Factory supports three types of activities:

Data Movement Activities: Move data from one source to another.
Data Transformation Activities: Transform and clean the data using tools like Data Flows or external compute services.
Control Activities: Control the pipeline flow (e.g., conditions, loops, or triggers).

3. Data flows:

Data Flows are a specialized activity in Azure Data Factory for designing data transformations built on the Apache Spark framework. They allow you to perform multi-step transformations through a visual interface, requiring minimal code beyond basic data expressions.

The Mapping Data Flow interface in Azure Data Factory enables visual data transformations. Key components include:

Visual Editor (Top):

Source Node (Athletes): Represents the input dataset with metadata details.
Transformation Node (athletesfilter): Applies filters or modifies data using expressions (e.g., filtering by "Discipline").
Sink Node (sink1): Defines the output dataset for transformed data.

Available Transformations:

Includes options like Join, Aggregate, Pivot, and Derived Column for flexible data manipulation.

Configuration Panel (Bottom):

Configure source settings, transformations, schema drift, and output details.

=> This intuitive interface simplifies designing complex data workflows without requiring extensive coding.

4. Datasets

Define the structure, format, and location of the data used in pipelines, whether it is input or output. Datasets represent specific data entities such as files, folders, or tables and are configured with details like:

Connection: Specifies a linked service to establish a connection to the data store.
Path: Identifies the file or folder in the storage system.
Schema: Defines the structure of the data (e.g., column names and types).
Settings: Includes delimiters, compression type, encoding, and header row configuration.

The dataset connects to the data store through a linked service, which provides the connection details such as credentials and endpoint. Additionally, the Preview Data panel allows validation of the dataset configuration by displaying sample records.

5. Linked Services:

Configure connections to external data sources (e.g.,HTTP, databases, storage systems) and compute resources. Acts as a bridge to integrate ADF with other services.

6. Triggers

Create and manage triggers to automate pipeline execution. Options include

Schedule Trigger: Executes pipelines at specified times or intervals.
Storage Event Trigger: Automatically triggers pipelines based on blob creation or deletion events in Azure Storage.
Tumbling Window Trigger: Executes pipelines in fixed, non-overlapping time intervals for batch processing, example: 1h, 2h, 3h,...
Custom Event Trigger: Triggers pipelines based on custom events published to Azure Event Grid.

7. Integration Runtimes

Provides the computing infrastructure to execute activities like data movement, transformation, and integration across various environments (cloud, on-premises, or hybrid). There are three types:

Auto Resolve Integration Runtime (a type of Azure Integration Runtime): Handles cloud-to-cloud data movement and transformations.
Self-Hosted Integration Runtime: Connects to on-premises or private network data sources securely.
Azure-SSIS Integration Runtime: Runs SQL Server Integration Services (SSIS) packages in the cloud.

🔑 Conclusion

In this discussion, we explored the Author and Manage pages in detail, highlighting the key components of Azure Data Factory and how they interconnect. Here's a comprehensive view of these components working together:

1. Pipelines: Orchestrate and execute one or more activities to manage data workflows.

2. Activities: Perform tasks such as moving data (e.g., Copy Activity), transforming data (e.g., Mapping Data Flow), or orchestrating control flows (e.g., ForEach, If Condition).

3. Datasets: Represent the structure and format of input and output data, such as files, folders, or database tables. They specify details like schema, delimiters, encoding, and are connected to their respective data stores through linked services.

4. Linked Services: Define the connection to external data sources or services, including Azure Blob Storage, Azure SQL Database, on-premises systems, or third-party platforms.

5. Triggers: Automate pipeline execution at specific times (e.g., scheduled triggers) or in response to events (e.g., tumbling window triggers).

6. Data Flows: Facilitate scalable, code-free data transformations by leveraging a visual interface. Data Flows operate on a managed Apache Spark cluster and support:

Mapping Data Flow: Perform transformations such as joins, aggregations, filtering, sorting, and derived columns.
Wrangling Data Flow: Use Power Query for interactive data preparation and transformations.
Built-in Transformations: Include filter, join, union, pivot, aggregate, surrogate key generation, and schema manipulation.
Source and Sink: Define where data comes from and where it goes after transformation.
Parameterization: Enable reusable and dynamic configurations for flexible data processing workflows.

7. Integration Runtimes (IR): Provide the compute and connectivity environment to perform activities. They include:

Azure IR: For cloud-to-cloud data movement and transformations.
Self-hosted IR: For accessing on-premises or private network data securely.
Azure-SSIS IR: For running SQL Server Integration Services (SSIS) packages in Azure.

This interconnected framework ensures seamless integration, movement, transformation, and processing of data within Azure Data Factory.

This visual guide provides a detailed overview of the complete Data Factory architecture by Microsoft:

🙌 References:

[1] Overview of Azure Data Factory Components

[2] What is Azure Data Factory?

Overview of Azure Data Factory Components

Nguyễn Tuấn Dương

⚡️ Data Engineer - GDS | VNGGames