Part 2: Orchestrating the Pipeline – Airflow Integration & Anomaly Flagging

Saurabh .D. Tikekar

Data Engineer @ RDSolutions India | Azure ADLS, ADF, Big data pipelines, Amazon Athena, AWS Glue

Published May 19, 2025

In Part 1, we built the foundation of a data quality monitoring system using Great Expectations (GE) integrated with PySpark. We defined validation rules and applied them to incoming raw data to detect anomalies and ensure only clean, trustworthy records reached downstream systems.

Now, it’s time to orchestrate this validation process using Apache Airflow a powerful workflow orchestration tool that allows us to schedule, monitor, and manage complex pipelines with ease.

In this part, we will walk through building an automated data pipeline that:

Ingests raw data
Executes Spark jobs integrated with GE
Evaluates validation results
Takes conditional action:

Let’s dive in.

What We'll Cover

Step 1: Setting up Apache Airflow

If you’re working locally, follow these steps to install and configure Airflow in a virtual environment:

Navigate to http://localhost:8080 and log in with your admin credentials.

Airflow uses Directed Acyclic Graphs (DAGs) to define workflows. Each DAG is composed of tasks, which can be Python scripts, Bash commands, Spark jobs, or API calls.

Step 2: Designing the Pipeline Logic

We want to build a DAG that handles the following:

Ingests raw data using Spark
Validates the data using GE expectations suite
Branches based on validation results:

Let’s sketch the structure:

Code Walkthrough – Building the DAG

Here’s the fully functional DAG broken into modular sections for better readability and scalability.

spark_validation_pipeline.py

Supporting Code: Modular Task Scripts

Inside your my_project/spark_jobs/ directory:

ingest_data.py

run_validation_with_ge.py

write_clean_data.py

store_bad_records.py

log_anomalies.py

Deliverables

By the end of this section, we’ve accomplished:

Task Status Built a DAG to orchestrate the full pipeline ✅ Integrated Spark with Great Expectations validation ✅ Implemented conditional branching in Airflow ✅ Stored bad records in quarantine zone ✅ Logged anomalies for downstream analysis ✅

📌 Final Thoughts

This approach ensures data quality gates are automated before any data reaches the final destination. Airflow adds modularity, observability, and recovery capabilities to the pipeline. You can now monitor task status, retry failures, and even trigger alerts, which we’ll explore next.

🔮 Coming Next: Part 3 – Self-Healing Pipelines

In the final part of this series, we’ll explore:

Slack & email alerts for failed validations
Auto-retries and backoff strategies
Circuit breaker patterns and pipeline rollback logic
Tracking validation metrics over time

🔔 Stay tuned for: 📘 Part 3: Self-Healing Pipelines – Slack Alerts, Retries & Recovery Logic

Part 2: Orchestrating the Pipeline – Airflow Integration & Anomaly Flagging

Saurabh .D. Tikekar

Data Engineer @ RDSolutions India | Azure ADLS, ADF, Big data pipelines, Amazon Athena, AWS Glue

What We'll Cover

Step 1: Setting up Apache Airflow

Step 2: Designing the Pipeline Logic

Code Walkthrough – Building the DAG

spark_validation_pipeline.py

Supporting Code: Modular Task Scripts

ingest_data.py

run_validation_with_ge.py

write_clean_data.py

store_bad_records.py

log_anomalies.py

Deliverables

📌 Final Thoughts

🔮 Coming Next: Part 3 – Self-Healing Pipelines

More articles by this author

Others also viewed

Top 10 Docker Commands Every Data Engineer Should Know

Introduction to Data Engineering Concepts |12| Scheduling and Workflow Orchestration

Forte Spotlight: Internal Development Platforms (IDPs), Key Roles In Data Engineering and More

Mastering Map-Reduce and Pipelining in Node.js for Efficient Data Processing

Accelerate Data Pipeline Development with Dagster Components

💊 DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

Part 1: Simplifying Data Engineering — Freeing teams from pipeline firefighting

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

A New Chapter in Analytics Engineering: dbt Labs Introduces the Fusion Engine and Official VS Code Extension

Dynamically Build and Schedule DeltaStreamer Jobs to EMR Serverless and Airflow Dag Creation

Explore topics

What We'll Cover

Step 1: Setting up Apache Airflow

Step 2: Designing the Pipeline Logic

Code Walkthrough – Building the DAG

spark_validation_pipeline.py

Supporting Code: Modular Task Scripts

ingest_data.py

run_validation_with_ge.py

write_clean_data.py

store_bad_records.py

log_anomalies.py

Deliverables

📌 Final Thoughts

🔮 Coming Next: Part 3 – Self-Healing Pipelines

Part 3: Storing and Querying Streamed Data in PostgreSQL

Jun 11, 2025

Part 2: Streaming Processing with Spark Structured Streaming

Jun 6, 2025

Part 1: Setting Up Apache Kafka and Simulating Real-Time Data for Stream Processing

Jun 2, 2025

Part 3: Making It Smart – Slack Alerts & Self-Healing Mechanisms in Data Pipelines

May 23, 2025

Part 1: Laying the Foundation – Data Validation with Great Expectations & PySpark

May 15, 2025

Part 4: Final Thoughts – Pros, Cons, and Best Practices Across Storage Solutions

May 6, 2025

Part 3: Processing Data with PySpark from Multiple Storage Layers

Apr 28, 2025

Part 2: Ingesting & Storing Raw Data Comparing S3, Azure Data Lake, HDFS, and GCS

Apr 22, 2025

Part 1: Project Kickoff – Designing a Cloud-Agnostic Data Pipeline

Apr 15, 2025

Part 4: Choosing Between ETL and ELT – Best Practices and Future Trends

Apr 6, 2025

Others also viewed

Top 10 Docker Commands Every Data Engineer Should Know

Introduction to Data Engineering Concepts |12| Scheduling and Workflow Orchestration

Forte Spotlight: Internal Development Platforms (IDPs), Key Roles In Data Engineering and More

Mastering Map-Reduce and Pipelining in Node.js for Efficient Data Processing

Accelerate Data Pipeline Development with Dagster Components

💊 DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

Part 1: Simplifying Data Engineering — Freeing teams from pipeline firefighting

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

A New Chapter in Analytics Engineering: dbt Labs Introduces the Fusion Engine and Official VS Code Extension

Dynamically Build and Schedule DeltaStreamer Jobs to EMR Serverless and Airflow Dag Creation

Explore topics