Fabric Data Engineer

Hi all!

Currently, as a part of my learning journey, I’m planning to take up the DP-700 certification. This will help you and me with our career growth. Let me share the things that I learnt in this particular certification practice.

Lakehouses:-

This merges data lake storage flexibility with warehouse analytics. The foundation of this Fabric Data Engineering is lakehouse, which is built on top of the OneLake scalable storage layer and uses Apache Spark and SQL compute engines for big data processing.

With lakeshouse, you can query and analyse data.
It is flexible and scalable in terms of storage.

Using this Fabric, we can handle both structured and unstructured data.

A lakehouse is like a database built on top of a data lake using Delta format tables. It can scale automatically.

Data Lake:- Scalable, distributed file storage. Compatible with Big Data.

Data Warehouse:- SQL-based querying, relational schema modelling. A base for reporting.

Lakehouse uses Spark and SQL engines to process large-scale data and supports ML.
Supports ACID
It is a single location for data engineers, scientists, etc.
You can define the schema.

ETL is used to ingest and transform the data.

So, whenever you create a lakehouse, you have three different items automatically in your workspace.

The lakehouse contains shortcuts, files, tables, etc.
Semantic Model (default) provides an easy data source for Power BI report developers.
SQL analytics endpoint allows read-only access to query data.

There are 2 models:

You can add and interact with tables, files, etc.
SQL analytics endpoint enables you to use SQL.

To ingest data,

Upload local files
Dataflows Gen2
Notebooks: Apache
Data Factory Pipelines

Source data permissions and credentials are all managed by OneLake.

Microsoft Fabric Lakehouse is an analytical store that combines the file storage flexibility of a data lake with the SQL-based query capabilities of a warehouse. Create a notebook to use Apache Spark. This belongs to SAAS.

Continuous Integration and Continuous Delivery (CI/CD)

This is done using Git integration and deployment pipelines. You can collaborate with your team.

CI

It helps you to avoid bugs and code failures. So, we frequently commit code to a shared code branch. Developers work in separate code branches on their local machines for long periods and only merge their changes to the main codebase once their work is over. Prime example for this is GitHub.

CD

Process of automatically releasing updates into production environments through stages.

How do you use this CI/CD in Fabric?

Integration is done using Git. Deployment is done using Fabric pipelines. Automation is done using REST APIs.

So, pipelines enable a CI/CD approach that ensures the content is updated, tested, etc.

How to create a pipeline?

There are 2 methods.

Use a workspace towards your left pane.
Create deployment pipeline icon at the top of the workspace.

To create a deployment pipeline:-

Select the Workspaces icon on the left navigation pane, then Deployment pipelines.
Select New Pipeline. Name the pipeline and give the next.
Define and name the stages, then give Create.
Assign a workspace to the stage.

REST API stands for Representational State Transfer Application Programming Interface, which is used to manage and interact with various Azure services.

The advantages of using Fabric REST APIs are to automate repetitive processes and to have seamless integration with other systems and applications.

Deployment pipelines automate the movement of content through development, test and production stages. Fabric workspace will sync content between Git and the workspace. Git lets your team collaborate using branches and provides version control.

Monitor activities

This helps you to gain visibility of your data systems. This collects and aggregates data. The Microsoft Facric Activator helps you to take actions when patterns or conditions are detected in streaming data.

Let’s say monitoring is the process of collecting system data and metrics that determine if a system is healthy and operative. In Fabric, you can schedule activities and jobs that perform tasks like data movement and transformation. Data delays are noted, and end-to-end processes are managed.

Data Pipeline Activity: ETL activities take place.
Dataflows: Tool for ingesting, loading and transforming data.
Semantic Model Refreshes: Visual representation that contains transformations, calculations, etc.
Spark jobs, notebooks and lakehouses: Notebooks are an interface for developing Apache Spark jobs. Data gets loaded or transformed using Spark and notebooks.
Eventstreams: These are observations.

Activator is a technology that enables automated processing of events that trigger actions. Eg: Notifications via email post a completion of a process.

Events: Each record in a stream of data.
Objects: Sales order, or like some business entity.
Rules: Automate actions based on events
Properties: Field measured by environment. Eg: sales, temperature, etc.

Ingest Data with Dataflows Gen2

Dataflows Gen2 are used to ingest and transform data from multiple sources and send them to another destination. Reusabilty is possible here.

Dataflows are like cloud-based ETL tool. Dataflows Gen2 allow you to extract data from various sources, transform it and load into a destination. Using Power Query allows for visual interface.

Generally, in a pipeline you copy the data, then use the preferred language and load the data. In Dataflow Gen2, you extract, transform the data and load it into a lakehouse. This is a curated semantic layer.

Another method is ELT. For this, use a pipeline to extract and load the data into your preferred destination, such as a lakehouse. Dataflows are partitioned horizontally.

Pipelines provide a visual way to complete tasks in a specific order. They are scheduled to run your dataflow.

Dataflow Gen2 is a way to import and transform data with Power Query Online. This is created using Data Factory as our workload. Primarily used for data integration.

Apache Spark

This core technology for large-scale analytics supports Spark Clusters, enabling you to analyse and process data in Lakehouse. Due to its parallel processing and distributed data processing framework, Apache Spark is widely used in Big Data processing. Spark uses a “divide and conquer” approach to process data quickly. Spark is written in Java, Scala, etc. Pyspark and Spark SQL combination is most suited for Data Engineering. Notebooks are used.

Node Family:-

Virtual machines used for Spark Cluster Nodes.

Autoscale:-

Automatically provision nodes or not based on the number of nodes.

Dynamic Allocation:-

Based on data volumes whether to allocate executor processes on the worker nodes. This decision is made here.

Native Execution Engine:-

This is a vectorised processing engine that runs Spark directly on Lakehouse. Using this improves the performance of queries.

MLFlow Logging:-

It is an open-source library that is used in data science to manage ML training and model deployment.

Spark uses a data structure called RDD (resilient distributed dataset). “Magic” tells the Spark what language to use. Partitioning is an optimisation technique. Spark catalog is a metastore for relational objects like views and tables. Spark runtime can use catalog to seamlessly integrate code written in any Spark-supported language with SQL expressions.

Secure Data Access

Microsoft Fabric has a multi-layer security model with access controls at different levels.

Fabric has three security levels:-

1. Microsoft Entra ID authentication:-

Checks if the user can authenticate to the Azure identity.

2. Fabric access:-

Checks if the user can access Fabric.

3. Data Security:-

Checks if the user can perform the requested action on a table.

A workspace in a Fabric lets you distribute ownership and access policies using workspace roles. Within this workspace, you can create Fabric data items like lakehouses, warehouses and semantic models. Fabric data items store data in OneLake. Access to data in a lakehouse is done using RBAC.

Workspaces are environments where users can collaborate to create groups.

There are four workspace roles, namely:- Admin, Member, Contributor, and Viewer.

Data in a lakehouse can be read through the SQL analytics endpoint. These are autogenerated SQL analytics that can be used to transition between the lake view of the lakehouse and the SQL view of the lakehouse. Lake view supports data engineering and Apache Spark, and the SQL view of the same lakehouse allows you to create views, functions, etc.

Eventstreams

Eventstreams with RTI, i.e. Real-Time Intelligence, enable you to ingest and transform real-time data. This will capture and process the data without any coding. You can use multiple source connectors such as Azure Event Hubs, Azure storage, etc. You can send data to a wide range of destinations for further analysis. Eventstreams work by creating pipelines. Eventstream Visual Editor is used to design pipelines.

Components of an Eventstream are:- Sources, Transformations, and Destinations.

Basically, an eventstream will capture real-time data from a source and load data into a destination. Examples of eventstream sources are Azure IoT Hub, Azure Event Hub, etc. Examples of destinations are Lakehouse, Eventhouse, etc.

We use Eventstreams for ingesting and transforming real-time data. Temporal window transformations enable you to aggregate event data in a stream based on specific time periods. Data sent to an Activator destination can be used to trigger an automated action based on the data value. This is the main purpose of the Fabric Activator destination.

Delta Lake Tables

Tables in Lakehouse are based on the Delta Lake storage format, which is commonly used in Apache Spark. These are Linux-based. Delta Lake is a storage layer for Spark that enables a relational database for batch and streaming data. By using this, you can implement a lakehouse architecture to support SQL-based semantics in Spark. It also supports ACID transactions. You organise data into partitions.

OptimizeWrite is a feature of Delta Lake which reduces the number of files written. V-Order is something optional that enables lightning-fast reads. “VACCUM” enables you to remove old data files.

A Spark Structured Streaming dataframe can read data from many kinds of streaming sources, including:

Network ports
Real-time message brokering services such as Azure Event Hubs, Kafka, etc.
File system locations

Delta Lake is a relational storage layer for Spark that supports tables based on Parquet files. Delta is a format you should use to write the dataframe to storage. If you drop a table, its metadata and data files are deleted.

Medallion Architecture

Bronze, Silver and Gold are the layers of a lakehouse for optimised analytics. With this architecture, you can organise the lakehouse logically. Bronze stands for raw, Silver stands for validated, and Gold stands for enriched, each representing the quality levels. This is also called “multi-hop” architecture. You can consider this a data cleaning framework. Depending on the organisation’s needs, we can build more layers. Data orchestration uses pipelines as its primary tool.

We ingest raw data in Bronze and use tools like Pipelines, notebooks, etc. We clean and validate data in the Silver layer and use tools like Notebooks or Dataflows. Transformations and modelling will require SQL.

Data Analysts connect to the semantic model using Direct Lake Mode.

Orchestrate Process

These are Data Factory capabilities, i.e. creating pipelines that orchestrate data ingestion and transformation tasks. As discussed, pipelines are primarily used for ETL. Activities are executable tasks in a pipeline. Use parameters to increase the reusability of pipelines. “Copy Data” activity is the prime use of a data pipeline that is used to ingest data from an external source. You can combine multiple activities in a pipeline.

Data Warehouses

These are analytical stores built on a relational schema to support SQL queries. They are built on a Lakehouse. With a Data Warehouse, you can ingest, store, process, and deliver data. Tables in a data warehouse are organised, and so this is referred to as dimensional modelling.

Fact tables contain numerical data. Example: Amount paid.

Dimension tables contain descriptive information about data in the fact table. Example: Information about customers who paid.

A dimension table can have two key columns, namely:-

Surrogate key is a unique identifier for each row in the dimension table. It’s often an integer automatically generated.

Alternate key is a natural key to specify an instance. Eg:- Customer ID

Time dimensions provide information about the time period in which an event occurred.

Slowly changing dimensions are dimension tables that track changes to attributes.

In most transactional databases, data is normalised to reduce duplication. If the fact table is directly related to the dimension table, it is called a star schema. If there are many levels of information shared by different things, then we use a snowflake schema.

A semantic model defines relationships between different tables, rules like how data is aggregated and summarised, and calculations that are used to derive insights from the data.

You can build relationships that allow you to connect tables in a semantic model. Measures are metrics that you want to analyse in your warehouse. These are calculated fields that are based on data in tables where we use Data Analysis Expressions (DAX) formula language.

In Snowflake schema, the component that contains normalised data is the dimension tables. To ensure data stays synchronised between the warehouse and the lakehouse, use cross-database querying to access the Lakehouse directly. First, ingest data, then load it into the staging tables, load dimension tables, load the fact tables and lastly optimise. A central fact table surrounded by denormalised dimension tables represents a star schema. To monitor connection, session, etc., we use sys.dm_exec_requests.

Load Data into Warehouse

A Data Warehouse refers to storing, organising, and managing large volumes of data. In this Fabric, it is powered with Synapse Analytics. The data is stored in Parquet file format. ‘One Lake’ is a combination of a warehouse and a lakehouse.

SCD

Please refer to the article below for this topic.

https://aamir1610-67719.medium.com/scd-data-warehousing-aamir-p-fafe68c3d91b?source=post_page-----c9ff5298da94---------------------------------------

Data pipeline is a cloud-based service for data integration, which enables the creation of workflows for data movement and transformation. You can schedule and create pipelines that can ingest and load data. The ‘Copy’ statement is used to import data into the warehouse. To connect to an Azure account, you need to use either a Shared Access Signature (SAS) or a Storage Account Key (SAK). Dataflow Gen2 is the new generation of dataflow. Copilot is used for assisting with dataflow transformations.

Real-Time Intelligence

Batch Data Analytics, in which data is loaded into an analytical store at periodic intervals. This is implemented using warehouse or lakehouse.

Real-time data analytics, in which data from events is ingested in real-time as it occurs. This is a kind of lambda architecture. This is commonly based on the ingestion and processing of a data stream that contains a perpetual series of data. Eg: Weather prediction website that gets data from internet-connected weather sensor.

With this Real-Time Intelligence, you can:-

Create an eventstream to capture, transform, and ingest data from various sources.
Store captured real-time data in an eventhouse, which includes one or more KQL databases.
Visualise real-time data using Power BI
Configure alerts that use Activator
Query and analyse using KQL queries.

In real-time hub, you can:-

Find and connect sources
Create real-time dashboards
Endorse and share real-time data
Subscribe to Fabric and Azure events
Share across your organisation

Data Sources for eventstreams

External services, like Azure Storage, Azure Event Hubs, etc., are in relational database services.

Fabric events, such as changes to the Fabric workspace, OneLake data stores, etc., are associated with Fabric jobs.

Sample data, a range of samples that explore real-time analytics.

KQL(Kusto Query Language) is used to write queries in Azure Data Explorer, Fabric, etc. It is a read-only request to process data and return results. This is a subset of T-SQL.

Activator works on four concepts

Events: Each record in a stream of data represents an event that has occurred at a specific point in time.

Objects: Data in an event record can be used to represent an object such as sales order or some business entity.

Properties: The field in the event data can be mapped to properties of a business object, representing an aspect of its state. For example, the total_amount field represents a sales order total.

Rules: Activator automates actions based on events. Example: Defining a rule that sends an email to a maintenance manager.

Real-Time Data in Eventhouse

An eventhouse provides a data store for large volumes of data. It is optimised for data that represents time-based events. An eventhouse contains one or more KQL databases, in which you can create tables, stored procedures, views, etc.

KQL

Filter Columns use the project keyword.

Filter Rows use the where keyword.

getmonth & getyear functions are used to restrict results to a concern time period.

ingestion_time function filters based on data loaded in the table.

summarise is used to summarise the results.

So, that’s it for the day! Hope you found the article useful.

Check out this link to know more about me

Let’s get to know each other! https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc. https://linktr.ee/aamirp

Get my Podcasts on Spotify https://lnkd.in/gG7km8G5

Catch me on Medium https://lnkd.in/gi-mAPxH

Follow me on Instagram https://lnkd.in/gkf3KPDQ

Udemy Udemy (Python Course) https://lnkd.in/grkbfz_N

YouTube https://www.youtube.com/@knowledge_engine_from_AamirP

Subscribe to my Channel for more useful content.

Fabric Data Engineer — Aamir P

AAMIR P

Senior Software Engineer at Tiger Analytics

Lakehouses:-

Continuous Integration and Continuous Delivery (CI/CD)

Monitor activities

Ingest Data with Dataflows Gen2

Apache Spark

Secure Data Access

Eventstreams

Delta Lake Tables

Medallion Architecture

Orchestrate Process

Data Warehouses

Load Data into Warehouse

Real-Time Intelligence

Real-Time Data in Eventhouse

Dive Into Data with Aamir P

1,632 follower

More articles by this author

Others also viewed

Your Comprehensive Guide to Becoming a Data Engineer in 2024

Building Data Pipelines with No-Code ETL Using AWS Glue Studio

The Most Important Concept in Data Engineering

Data Quality Checks and Validation Framework in Azure Pipelines: A Practical Guide for Azure Data Engineers

Azure Data Engineering Cheat Sheet

Understanding Data Pipelines and Key Data Tools

How Apache Airflow Works: An Insight for Modern Data Teams

What Skills Should Every Data Engineer Have in 2025? 🚀

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Explore topics

Lakehouses:-

Continuous Integration and Continuous Delivery (CI/CD)

Monitor activities

Ingest Data with Dataflows Gen2

Apache Spark

Secure Data Access

Eventstreams

Delta Lake Tables

Medallion Architecture

Orchestrate Process

Data Warehouses

Load Data into Warehouse

Real-Time Intelligence

Real-Time Data in Eventhouse

Dive Into Data with Aamir P

1,632 follower

Azure Data Factory — Aamir P

Jun 3, 2025

CPG (Consumer Packed Goods)— Aamir P

Feb 12, 2025

Dataiku — Aamir P

Oct 11, 2024

PySpark — Aamir P

Oct 3, 2024

Data Build Tool(DBT) — Aamir P

Sep 19, 2024

SSIS Data Warehouse Developer — Aamir P

Sep 10, 2024

Talend — Aamir P

Aug 7, 2024

Data Warehousing and BI Analytics — Aamir P

May 7, 2024

TensorFlow - Aamir P

Apr 24, 2024

Data Engineering — Aamir P

Mar 29, 2024

Others also viewed

Your Comprehensive Guide to Becoming a Data Engineer in 2024

Building Data Pipelines with No-Code ETL Using AWS Glue Studio

The Most Important Concept in Data Engineering

Data Quality Checks and Validation Framework in Azure Pipelines: A Practical Guide for Azure Data Engineers

Azure Data Engineering Cheat Sheet

Understanding Data Pipelines and Key Data Tools

How Apache Airflow Works: An Insight for Modern Data Teams

What Skills Should Every Data Engineer Have in 2025? 🚀

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Explore topics