Azure Data Factory — Aamir P
Azure Data Factory

Azure Data Factory — Aamir P

Hello readers!

I’m doing Azure Data Factory as part of my learning journey. Let us learn together in this article.

Feel free to correct me if anything is incorrect.

Azure Data Factory (ADF) is a cloud-based integration service that allows you to create, schedule and orchestrate workflows. You can move and transform data from diverse sources to destinations.

Why use ADF?

  1. It supports both ETL and ELT pipelines.

  2. Connects to a wide variety of data sources like SQL Server, Azure Blob Storage, REST APIs, and many more.

  3. You can build pipelines visually or integrate code using Data Flows.

  4. No infrastructure to maintain; Microsoft handles scaling and uptime.

  5. Allows you to coordinate data workflows — running multiple activities in sequence or parallel.

Architecture

Source → Linked Service → Dataset → Activity → Pipeline → Sink (Destination)

  1. Source/Sink: Where your data is coming from (source) and going to (sink).

  2. Linked Service: A connection string with authentication details.

  3. Dataset: A dataset represents the data structure and location inside a data store. For example, a specific CSV file in Blob Storage or a table in SQL Server.

  4. Activity: Activities are individual tasks within a pipeline, such as Copy Activity, Data Flow Activity, or Web Activity.

  5. Pipeline: A pipeline is a logical grouping of activities that together perform a data task, for example, copying data from a Blob storage to an Azure SQL database.

  6. Trigger: Triggers automate pipeline execution based on a schedule, event, or tumbling window.

  7. Integration Runtime: The compute infrastructure that carries out data movement and transformation. There are three types: a. Azure IR: For cloud-based data movement. b. Self-hosted IR: For accessing on-premises or private network data. c. Azure-SSIS IR: For running SSIS packages.

Features

  1. For complex transformations like joins, aggregations, or filtering, ADF offers Mapping Data Flows. This visual interface lets you create scalable Spark transformations without coding.

  2. You can make your pipelines dynamic by using parameters — values passed at runtime to datasets, linked services, or pipeline activities. This helps reuse pipelines for different files or databases.

  3. ADF has a monitoring dashboard to track pipeline and activity runs. You can set up alerts to get notified on failures or delays.

  4. Integrate ADF with Git (Azure DevOps or GitHub) to manage your pipeline code versions, collaborate with teams, and implement Continuous Integration/Continuous Deployment pipelines.

Let us build a pipeline for practical exposure purposes:-

Step 1: Create an Azure Data Factory Instance

  • Go to the Azure Portal.

  • Search for Data Factory and create a new instance.

  • Choose a resource group and region.

  • Wait for deployment.

Step 2: Create Linked Services

  • Open your Data Factory studio.

  • Under Manage, create linked services to your data sources/destinations.

  • For example, create a linked service for Azure Blob Storage with your storage account credentials.

  • Create another linked service for your SQL database.

Step 3: Create Datasets

  • Define datasets that point to the actual data you want to move or process.

  • E.g., a dataset for CSV files in Blob Storage.

  • A dataset for a table in Azure SQL Database.

Step 4: Build a Pipeline with Activities

  • Go to the Author tab.

  • Create a new pipeline.

  • Drag a Copy Activity onto the canvas.

  • Configure the source dataset (e.g., Blob Storage CSV).

  • Configure the sink dataset (e.g., Azure SQL Table).

Step 5: Validate and Debug

  • Validate your pipeline for errors.

  • Run the pipeline in debug mode to test.

Step 6: Trigger Pipeline

  • Add a trigger to schedule when this pipeline should run (e.g., daily at midnight).

  • Publish all changes to save them.

Use Cases

Typically, ADF is used for the following purposes, namely:-

  1. Data Migration: Let’s say, moving the SSIS package to the ADF space.

  2. Data Warehousing: Feeding data warehouses with batch or streaming data.

  3. ETL/ETL Pipelines: Extracting data from multiple sources, transforming it, and loading it into analytical stores.

  4. Real-time Analytics: Triggering pipelines based on events to support near-real-time data processing.

  5. Data Governance: Enforcing policies

Monitoring and Logging

ADF provides a Monitoring tab where you can:

  • View pipeline runs and activity status

  • Drill into activity inputs, outputs, and error messages

  • Set up alerts via Azure Monitor

Using Azure Logic App, you can send an email when a pipeline gets completed. This is done using Web Activity.

That’s it for the day! Thanks for reading. ADF is easy to learn and has a good demand in the market. Learning this is highly useful, being a Data Warehouse Engineer.

Check out this link to know more about me

Let’s get to know each other! https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc. https://linktr.ee/aamirp

Get my Podcasts on Spotify https://lnkd.in/gG7km8G5

Catch me on Medium https://lnkd.in/gi-mAPxH

Follow me on Instagram https://lnkd.in/gkf3KPDQ

Udemy Udemy (Python Course) https://lnkd.in/grkbfz_N

YouTube https://www.youtube.com/@knowledge_engine_from_AamirP

Subscribe to my Channel for more useful content.

To view or add a comment, sign in

Others also viewed

Explore topics