Fabric Data Engineer — Aamir P
Hi all!
Currently, as a part of my learning journey, I’m planning to take up the DP-700 certification. This will help you and me with our career growth. Let me share the things that I learnt in this particular certification practice.
Lakehouses:-
This merges data lake storage flexibility with warehouse analytics. The foundation of this Fabric Data Engineering is lakehouse, which is built on top of the OneLake scalable storage layer and uses Apache Spark and SQL compute engines for big data processing.
Using this Fabric, we can handle both structured and unstructured data.
A lakehouse is like a database built on top of a data lake using Delta format tables. It can scale automatically.
Data Lake:- Scalable, distributed file storage. Compatible with Big Data.
Data Warehouse:- SQL-based querying, relational schema modelling. A base for reporting.
ETL is used to ingest and transform the data.
So, whenever you create a lakehouse, you have three different items automatically in your workspace.
There are 2 models:
To ingest data,
Source data permissions and credentials are all managed by OneLake.
Microsoft Fabric Lakehouse is an analytical store that combines the file storage flexibility of a data lake with the SQL-based query capabilities of a warehouse. Create a notebook to use Apache Spark. This belongs to SAAS.
Continuous Integration and Continuous Delivery (CI/CD)
This is done using Git integration and deployment pipelines. You can collaborate with your team.
CI
It helps you to avoid bugs and code failures. So, we frequently commit code to a shared code branch. Developers work in separate code branches on their local machines for long periods and only merge their changes to the main codebase once their work is over. Prime example for this is GitHub.
CD
Process of automatically releasing updates into production environments through stages.
How do you use this CI/CD in Fabric?
Integration is done using Git. Deployment is done using Fabric pipelines. Automation is done using REST APIs.
So, pipelines enable a CI/CD approach that ensures the content is updated, tested, etc.
How to create a pipeline?
There are 2 methods.
To create a deployment pipeline:-
REST API stands for Representational State Transfer Application Programming Interface, which is used to manage and interact with various Azure services.
The advantages of using Fabric REST APIs are to automate repetitive processes and to have seamless integration with other systems and applications.
Deployment pipelines automate the movement of content through development, test and production stages. Fabric workspace will sync content between Git and the workspace. Git lets your team collaborate using branches and provides version control.
Monitor activities
This helps you to gain visibility of your data systems. This collects and aggregates data. The Microsoft Facric Activator helps you to take actions when patterns or conditions are detected in streaming data.
Let’s say monitoring is the process of collecting system data and metrics that determine if a system is healthy and operative. In Fabric, you can schedule activities and jobs that perform tasks like data movement and transformation. Data delays are noted, and end-to-end processes are managed.
Activator is a technology that enables automated processing of events that trigger actions. Eg: Notifications via email post a completion of a process.
Ingest Data with Dataflows Gen2
Dataflows Gen2 are used to ingest and transform data from multiple sources and send them to another destination. Reusabilty is possible here.
Dataflows are like cloud-based ETL tool. Dataflows Gen2 allow you to extract data from various sources, transform it and load into a destination. Using Power Query allows for visual interface.
Generally, in a pipeline you copy the data, then use the preferred language and load the data. In Dataflow Gen2, you extract, transform the data and load it into a lakehouse. This is a curated semantic layer.
Another method is ELT. For this, use a pipeline to extract and load the data into your preferred destination, such as a lakehouse. Dataflows are partitioned horizontally.
Pipelines provide a visual way to complete tasks in a specific order. They are scheduled to run your dataflow.
Dataflow Gen2 is a way to import and transform data with Power Query Online. This is created using Data Factory as our workload. Primarily used for data integration.
Apache Spark
This core technology for large-scale analytics supports Spark Clusters, enabling you to analyse and process data in Lakehouse. Due to its parallel processing and distributed data processing framework, Apache Spark is widely used in Big Data processing. Spark uses a “divide and conquer” approach to process data quickly. Spark is written in Java, Scala, etc. Pyspark and Spark SQL combination is most suited for Data Engineering. Notebooks are used.
Node Family:-
Virtual machines used for Spark Cluster Nodes.
Autoscale:-
Automatically provision nodes or not based on the number of nodes.
Dynamic Allocation:-
Based on data volumes whether to allocate executor processes on the worker nodes. This decision is made here.
Native Execution Engine:-
This is a vectorised processing engine that runs Spark directly on Lakehouse. Using this improves the performance of queries.
MLFlow Logging:-
It is an open-source library that is used in data science to manage ML training and model deployment.
Spark uses a data structure called RDD (resilient distributed dataset). “Magic” tells the Spark what language to use. Partitioning is an optimisation technique. Spark catalog is a metastore for relational objects like views and tables. Spark runtime can use catalog to seamlessly integrate code written in any Spark-supported language with SQL expressions.
Secure Data Access
Microsoft Fabric has a multi-layer security model with access controls at different levels.
Fabric has three security levels:-
1. Microsoft Entra ID authentication:-
Checks if the user can authenticate to the Azure identity.
2. Fabric access:-
Checks if the user can access Fabric.
3. Data Security:-
Checks if the user can perform the requested action on a table.
A workspace in a Fabric lets you distribute ownership and access policies using workspace roles. Within this workspace, you can create Fabric data items like lakehouses, warehouses and semantic models. Fabric data items store data in OneLake. Access to data in a lakehouse is done using RBAC.
Workspaces are environments where users can collaborate to create groups.
There are four workspace roles, namely:- Admin, Member, Contributor, and Viewer.
Data in a lakehouse can be read through the SQL analytics endpoint. These are autogenerated SQL analytics that can be used to transition between the lake view of the lakehouse and the SQL view of the lakehouse. Lake view supports data engineering and Apache Spark, and the SQL view of the same lakehouse allows you to create views, functions, etc.
Eventstreams
Eventstreams with RTI, i.e. Real-Time Intelligence, enable you to ingest and transform real-time data. This will capture and process the data without any coding. You can use multiple source connectors such as Azure Event Hubs, Azure storage, etc. You can send data to a wide range of destinations for further analysis. Eventstreams work by creating pipelines. Eventstream Visual Editor is used to design pipelines.
Components of an Eventstream are:- Sources, Transformations, and Destinations.
Basically, an eventstream will capture real-time data from a source and load data into a destination. Examples of eventstream sources are Azure IoT Hub, Azure Event Hub, etc. Examples of destinations are Lakehouse, Eventhouse, etc.
We use Eventstreams for ingesting and transforming real-time data. Temporal window transformations enable you to aggregate event data in a stream based on specific time periods. Data sent to an Activator destination can be used to trigger an automated action based on the data value. This is the main purpose of the Fabric Activator destination.
Delta Lake Tables
Tables in Lakehouse are based on the Delta Lake storage format, which is commonly used in Apache Spark. These are Linux-based. Delta Lake is a storage layer for Spark that enables a relational database for batch and streaming data. By using this, you can implement a lakehouse architecture to support SQL-based semantics in Spark. It also supports ACID transactions. You organise data into partitions.
OptimizeWrite is a feature of Delta Lake which reduces the number of files written. V-Order is something optional that enables lightning-fast reads. “VACCUM” enables you to remove old data files.
A Spark Structured Streaming dataframe can read data from many kinds of streaming sources, including:
Delta Lake is a relational storage layer for Spark that supports tables based on Parquet files. Delta is a format you should use to write the dataframe to storage. If you drop a table, its metadata and data files are deleted.
Medallion Architecture
Bronze, Silver and Gold are the layers of a lakehouse for optimised analytics. With this architecture, you can organise the lakehouse logically. Bronze stands for raw, Silver stands for validated, and Gold stands for enriched, each representing the quality levels. This is also called “multi-hop” architecture. You can consider this a data cleaning framework. Depending on the organisation’s needs, we can build more layers. Data orchestration uses pipelines as its primary tool.
We ingest raw data in Bronze and use tools like Pipelines, notebooks, etc. We clean and validate data in the Silver layer and use tools like Notebooks or Dataflows. Transformations and modelling will require SQL.
Data Analysts connect to the semantic model using Direct Lake Mode.
Orchestrate Process
These are Data Factory capabilities, i.e. creating pipelines that orchestrate data ingestion and transformation tasks. As discussed, pipelines are primarily used for ETL. Activities are executable tasks in a pipeline. Use parameters to increase the reusability of pipelines. “Copy Data” activity is the prime use of a data pipeline that is used to ingest data from an external source. You can combine multiple activities in a pipeline.
Data Warehouses
These are analytical stores built on a relational schema to support SQL queries. They are built on a Lakehouse. With a Data Warehouse, you can ingest, store, process, and deliver data. Tables in a data warehouse are organised, and so this is referred to as dimensional modelling.
Fact tables contain numerical data. Example: Amount paid.
Dimension tables contain descriptive information about data in the fact table. Example: Information about customers who paid.
A dimension table can have two key columns, namely:-
Surrogate key is a unique identifier for each row in the dimension table. It’s often an integer automatically generated.
Alternate key is a natural key to specify an instance. Eg:- Customer ID
Time dimensions provide information about the time period in which an event occurred.
Slowly changing dimensions are dimension tables that track changes to attributes.
In most transactional databases, data is normalised to reduce duplication. If the fact table is directly related to the dimension table, it is called a star schema. If there are many levels of information shared by different things, then we use a snowflake schema.
A semantic model defines relationships between different tables, rules like how data is aggregated and summarised, and calculations that are used to derive insights from the data.
You can build relationships that allow you to connect tables in a semantic model. Measures are metrics that you want to analyse in your warehouse. These are calculated fields that are based on data in tables where we use Data Analysis Expressions (DAX) formula language.
In Snowflake schema, the component that contains normalised data is the dimension tables. To ensure data stays synchronised between the warehouse and the lakehouse, use cross-database querying to access the Lakehouse directly. First, ingest data, then load it into the staging tables, load dimension tables, load the fact tables and lastly optimise. A central fact table surrounded by denormalised dimension tables represents a star schema. To monitor connection, session, etc., we use sys.dm_exec_requests.
Load Data into Warehouse
A Data Warehouse refers to storing, organising, and managing large volumes of data. In this Fabric, it is powered with Synapse Analytics. The data is stored in Parquet file format. ‘One Lake’ is a combination of a warehouse and a lakehouse.
SCD
Please refer to the article below for this topic.
Data pipeline is a cloud-based service for data integration, which enables the creation of workflows for data movement and transformation. You can schedule and create pipelines that can ingest and load data. The ‘Copy’ statement is used to import data into the warehouse. To connect to an Azure account, you need to use either a Shared Access Signature (SAS) or a Storage Account Key (SAK). Dataflow Gen2 is the new generation of dataflow. Copilot is used for assisting with dataflow transformations.
Real-Time Intelligence
Batch Data Analytics, in which data is loaded into an analytical store at periodic intervals. This is implemented using warehouse or lakehouse.
Real-time data analytics, in which data from events is ingested in real-time as it occurs. This is a kind of lambda architecture. This is commonly based on the ingestion and processing of a data stream that contains a perpetual series of data. Eg: Weather prediction website that gets data from internet-connected weather sensor.
With this Real-Time Intelligence, you can:-
In real-time hub, you can:-
Data Sources for eventstreams
External services, like Azure Storage, Azure Event Hubs, etc., are in relational database services.
Fabric events, such as changes to the Fabric workspace, OneLake data stores, etc., are associated with Fabric jobs.
Sample data, a range of samples that explore real-time analytics.
KQL(Kusto Query Language) is used to write queries in Azure Data Explorer, Fabric, etc. It is a read-only request to process data and return results. This is a subset of T-SQL.
Activator works on four concepts
Events: Each record in a stream of data represents an event that has occurred at a specific point in time.
Objects: Data in an event record can be used to represent an object such as sales order or some business entity.
Properties: The field in the event data can be mapped to properties of a business object, representing an aspect of its state. For example, the total_amount field represents a sales order total.
Rules: Activator automates actions based on events. Example: Defining a rule that sends an email to a maintenance manager.
Real-Time Data in Eventhouse
An eventhouse provides a data store for large volumes of data. It is optimised for data that represents time-based events. An eventhouse contains one or more KQL databases, in which you can create tables, stored procedures, views, etc.
KQL
Filter Columns use the project keyword.
Filter Rows use the where keyword.
getmonth & getyear functions are used to restrict results to a concern time period.
ingestion_time function filters based on data loaded in the table.
summarise is used to summarise the results.
So, that’s it for the day! Hope you found the article useful.
Check out this link to know more about me
Let’s get to know each other! https://lnkd.in/gdBxZC5j
Get my books, podcasts, placement preparation, etc. https://linktr.ee/aamirp
Get my Podcasts on Spotify https://lnkd.in/gG7km8G5
Catch me on Medium https://lnkd.in/gi-mAPxH
Follow me on Instagram https://lnkd.in/gkf3KPDQ
Udemy Udemy (Python Course) https://lnkd.in/grkbfz_N
Subscribe to my Channel for more useful content.