SlideShare a Scribd company logo
Azure Data Factory: Mapping Data Flows
Performance Tuning Data Flows
Agenda
 Data Lake ETL Performance
 Database ETL Performance
 Transformation optimizations
 Monitoring
 Global Settings
 Best Practices
 Azure Integration Runtimes
Data Lake ETL Performance
Sample Timings 1
Scenario 1
 Source: Delimited Text Blob Store
 Sink: Azure SQL DB
 File size: 421Mb, 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Current partitioning used throughout
Sample timings 2
 Scenario 2
 Source: Delimited Text Blob Store
 Sink: Delimited Text Blob store
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Leaving default/current partitioning throughout allows
ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker
cores)
File conversion Source->Sink property findings
 Large data sizes should use more vcores(16+) with memory optimized or
general purpose
 Compute optimized does not improve performance in this scenario
 CSV to parquet format convert has 45% time overhead in comparison with CSV
to CSV
 CSV to JSON format convert has 24% time overhead in comparison with CSV to
CSV
 CSV to JSON has better performance even though it has a lot of data to write
 CSV to parquet has a slight lag because of time spent in decompression
 Scaling V-cores improves performance for both IO and computation
File Partitioning
 Maintain current partitioning
 Avoid output to single file
 For manual partitioning, use number of cores from your Azure IR
and multiply by 5
 Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of
partitions would be 32 x 5 = 160 partitions
 If you know data well enough to have high-cardinality columns, use those columns as Hash
partition
 If you do not know data patterns very well, use Round Robin
File Conversion Timing
Compute type: General Purpose
• Dataset has 36 Columns of string, integer, short, double
• CSV dataset has 25 files with different file sizes
• Performance improvement scales proportionately with increase
in Vcores
• 8 Vcore to 64 Vcore performance increase is around 8 times more
Database ETL Performance
Sample timings for Azure SQL DB
 Scenario w/Azure SQL DB
 Source: Azure SQL DB Table
 Sink: Azure SQL DB Table
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Source partitioning on SQL DB Source, current partitioning
on Derived Column and Sink
SQL Database Timing
Synapse DW Timing
Compute type: General Purpose
Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a
fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
CosmosDB Timing
Compute type: General Purpose
Transformation Performance
Window / Aggregate Timing
Compute type: General Purpose
• Performance improvement scales proportionately with
increase in Vcores
• 8 Vcore to 64 Vcore performance increase is around 5
times more
Transformation Timings
Compute type: General Purpose
Transformation recommendations
• When ranking data across entire dataset, use Rank
transformation instead of Window with rank()
• When using rowNumber() in Window to uniquely
add a row counter to each row across entire
dataset, instead use the Surrogate Key
transformation
TPCH Timings
Compute type: General Purpose
TPCH CSV in ADLS Gen 2
Optimizing transformations
 Each transformation has its own optimize tab
 Generally better to not alter -> reshuffling is a relatively slow process
 Reshuffling can occur if data is very skewed
 One node has a disproportionate amount of data
 For Joins, Exists and Lookups:
 If you have a many of these transforms, memory optimized greatly increases performance
 Use cached lookup w/cached sink
 Can ‘Broadcast’ if the data on one side is small
 Rule of thumb: Less than 50k rows
 Use Window transformation partitioned over segments of data
 For Rank() across entire dataset, use the Rank transformation instead
 For RowNumber() across entire dataset, use the Surrogate Key transformation instead
 Transformations that require reshuffling like Sort negatively impact
performance
ETL Performance Monitoring
Identifying bottlenecks
1. Cluster startup time
2. Sink processing time
3. Source read time
4. Transformation stage time
1. Sequential executions can
lower the cluster startup time
by setting a TTL in Azure IR
2. Total time to process the
stream from source to sink.
There is also a post-processing
time when you click on the Sink
that will show you how much
time Spark had to spend with
partition and job clean-up.
Write to single file and slow
database connections will
increase this time
3. Shows you how long it took to
read data from source.
Optimize with different source
partition strategies
4. This will show you bottlenecks
in your transformation logic.
With larger general purpose
and mem optimized IRs, most
of these operations occur in
memory in data frames and are
usually the fastest operations
in your data flow
Global configurations that effect performance
 Logging level (pipeline activity)
 Verbose (default) is most expensive
 You can get a small increase in performance for large data flows without detailed logging
 Trade-off: Less diagnostics
 Error row handling (sink transformation)
 Expect 5%-10% perf hit
 Trade-off: Provides detailed logging and continuation of data flow on database driver errors
 Run in parallel (pipeline activity)
 Currently only available for “connected” streams, i.e. multiple sinks from a single stream
 Can write to multiple sinks at same time
 Use with new branch, conditional split
 Parallel activity executions (pipeline activity)
 If you place data flow activities on your pipeline canvas without connector lines, your data
flows can all start at the same time, lowering overall pipeline execution times.
ETL Performance Best Practices
Best practices - Sources
 When reading from file-based sources, data flow automatically
partitions the data based on size
 ~128 MB per partition, evenly distributed
 Use current partitioning will be fastest for file-based and Synapse using PolyBase
 Enable staging for Synapse
 For Azure SQL DB, use Source partitioning on column with high
cardinality
 Improves performance, but can saturate your source database
 Reading can be limited by the I/O of your source
Best practices – Debug (Data Preview)
 Data Preview
 Data preview is inside the data flow designer transformation properties
 Uses row limits and sampling techniques to preview data from a small size of data
 Allows you to build and validate units of logic with samples of data in real time
 You have control over the size of the data limits under Debug Settings
 If you wish to test with larger datasets, set a larger compute size in the Azure IR when
switching on “Debug Mode”
 Data Preview is only a snapshot of data in memory from Spark data frames. This feature does
not write any data, so the sink drivers are not utilized and not tested in this mode.
Best practices – Debug (Pipeline Debug)
 Pipeline Debug
 Click debug button to test your data flow inside of a pipeline
 Default debug limits the execution runtime so you will want to limit data sizes
 Sampling can be applied here as well by using the “Enable Sampling” option in each Source
 Use the debug button option of “use activity IR” when you wish to use a job execution
compute environment
 This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the
default debug setting
Best practices - Sinks
 SQL:
 Disable indexes on target with pre/post SQL scripts
 Increase SQL capacity during pipeline execution
 Enable staging when using Synapse
 Use Source Partitioning on Source under Optimize
 Set number of partitions based on size of IR
 File-based sinks:
 Use current partitioning allows Spark to create output
 Output to single file is a slow operation
 Often unnecessary by whoever is consuming data
 Can set naming patterns or use data in column
 Any reshuffling of data is slow
 Cosmos DB
 Set throughput and batch size to meet performance requirements
Azure Integration Runtime Best Practices
 Data Flows use JIT compute to minimize running expensive clusters
when they are mostly idle
 Generally more economical, but each cluster takes ~4 minutes to spin up
 IR specifies what cluster type and core-count to use
 Memory optimized is best, compute optimized doesn’t generally work for production workloads
 When running Sequential jobs utilize Time to Live to reuse cluster
between executions
 Keeps compute resources alive for TTL minutes after execution for new job to use
 Maximum one job per cluster
 Reduces job startup latency to ~1.5 minutes
 Click “Quick reuse” to lower sequential activity start-up times to < 10 seconds
 Rule of thumb: start small and scale up
Azure IR – General Purpose
• This was General Purpose 4+4, the default auto resolve Azure IR
• For prod workloads, GP is usually sufficient at >= 16 cores
• You get 1 driver and 1 worker node, both with 4 vcores
• Good for debugging, testing, and many production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 4 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 46s
• Transformation time: 42s
• Sink post-processing time: 45s
Azure IR – Compute Optimized
• Computed Optimized intended for smaller workloads
• 8+8, this is smallest CO option and you get 1 driver and 2
workers
• Not suitable for large production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 8 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 20s
• Transformation time: 35s
• Sink post-processing time: 40s
• More worker nodes gave us more partitions and better perf than
General Purpose
Azure IR – Memory Optimized
• Memory Optimized well suited for large production workload
reliability with many aggregates, lookups, and joins
• 64+16 gives you 16 vcores for driver and 64 across worker nodes
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 64 partitions
• Cluster startup time: 4.8 mins
• Sink IO writing: 19s
• Transformation time: 17s
• Sink post-processing time: 40s
Resources
 Complete Data Flows Performance Tuning and Profiles Deck
 https://www2.slideshare.net/kromerm/azure-data-factory-data-flow-performance-tuning-101
 Data Flows Training
 https://www2.slideshare.net/kromerm/azure-data-factory-data-flows-training-sept-2020-update
 Data Flows Video Tutorials
 https://docs.microsoft.com/en-us/azure/data-factory/data-flow-tutorials
 Data Flows Performance Home Page
 https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance
 Copy Data Performance Guidance
 https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance

More Related Content

PPTX
Databricks for Dummies
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PDF
Scaling Data Analytics Workloads on Databricks
PDF
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
PDF
Degrading Performance? You Might be Suffering From the Small Files Syndrome
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
PDF
How Adobe uses Structured Streaming at Scale
Databricks for Dummies
Azure Data Factory Data Flow Performance Tuning 101
Scaling Data Analytics Workloads on Databricks
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Koalas: Making an Easy Transition from Pandas to Apache Spark
Powering Interactive BI Analytics with Presto and Delta Lake
How Adobe uses Structured Streaming at Scale

What's hot (20)

PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
PDF
Spark Summit EU talk by Mike Percy
PDF
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
PDF
Accelerating Data Ingestion with Databricks Autoloader
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
PDF
Scalable And Incremental Data Profiling With Spark
PPTX
Apache Spark and Online Analytics
PDF
Data Science Across Data Sources with Apache Arrow
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Native Support of Prometheus Monitoring in Apache Spark 3.0
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
PDF
Spark and S3 with Ryan Blue
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit EU talk by Mike Percy
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Accelerating Data Ingestion with Databricks Autoloader
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Scalable And Incremental Data Profiling With Spark
Apache Spark and Online Analytics
Data Science Across Data Sources with Apache Arrow
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Deep Dive into the New Features of Apache Spark 3.0
Writing Continuous Applications with Structured Streaming in PySpark
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Native Support of Prometheus Monitoring in Apache Spark 3.0
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Spark and S3 with Ryan Blue
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Ad

Similar to Mapping Data Flows Perf Tuning April 2021 (20)

PPTX
Fabric Data Factory Pipeline Copy Perf Tips.pptx
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PPTX
Azure Data Factory for Redmond SQL PASS UG Sept 2018
PDF
Free Demo on #Microsoft #SQLServer & #T-SQL with #Azure from SQL School
PDF
Azure Data Engineering Course in Hyderabad
PPTX
"Azure Data Engineering Course in Hyderabad "
PPTX
Azure Data Engineering course in hyderabad.pptx
PPTX
Azure Data Engineering Course in Hyderabad
PPTX
Implementing Real-Time IoT Stream Processing in Azure
PPTX
Intro to Azure Data Factory v1
PPTX
New capabilities for modern data integration in the cloud
PDF
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
PDF
Azure Data Factory Interview Questions PDF By ScholarHat
PPTX
Azure Data Engineering.pptx
PDF
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
PDF
Azure Data Engineering.pdf
PPTX
New capabilities for modern data integration in the cloud
PDF
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Azure Data Factory Data Flows Training (Sept 2020 Update)
Day 1 - Technical Bootcamp azure synapse analytics
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Free Demo on #Microsoft #SQLServer & #T-SQL with #Azure from SQL School
Azure Data Engineering Course in Hyderabad
"Azure Data Engineering Course in Hyderabad "
Azure Data Engineering course in hyderabad.pptx
Azure Data Engineering Course in Hyderabad
Implementing Real-Time IoT Stream Processing in Azure
Intro to Azure Data Factory v1
New capabilities for modern data integration in the cloud
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Azure Data Factory Interview Questions PDF By ScholarHat
Azure Data Engineering.pptx
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Data Engineering.pdf
New capabilities for modern data integration in the cloud
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Ad

More from Mark Kromer (20)

PPTX
Build data quality rules and data cleansing into your data pipelines
PPTX
Mapping Data Flows Training deck Q1 CY22
PPTX
Data cleansing and prep with synapse data flows
PPTX
Data cleansing and data prep with synapse data flows
PPTX
Mapping Data Flows Training April 2021
PPTX
Data Lake ETL in the Cloud with ADF
PPTX
Azure Data Factory Data Wrangling with Power Query
PPTX
Data Quality Patterns in the Cloud with ADF
PPTX
Data quality patterns in the cloud with ADF
PPTX
Azure Data Factory Data Flows Training v005
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
PPTX
ADF Mapping Data Flows Level 300
PPTX
ADF Mapping Data Flows Training V2
PPTX
ADF Mapping Data Flows Training Slides V1
PDF
ADF Mapping Data Flow Private Preview Migration
PPTX
Azure Data Factory ETL Patterns in the Cloud
PPTX
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
PPTX
Azure Data Factory Data Flow
PPTX
Azure Data Factory Data Flow Limited Preview for January 2019
PPTX
Microsoft Azure Data Factory Data Flow Scenarios
Build data quality rules and data cleansing into your data pipelines
Mapping Data Flows Training deck Q1 CY22
Data cleansing and prep with synapse data flows
Data cleansing and data prep with synapse data flows
Mapping Data Flows Training April 2021
Data Lake ETL in the Cloud with ADF
Azure Data Factory Data Wrangling with Power Query
Data Quality Patterns in the Cloud with ADF
Data quality patterns in the cloud with ADF
Azure Data Factory Data Flows Training v005
Data Quality Patterns in the Cloud with Azure Data Factory
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flow Private Preview Migration
Azure Data Factory ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Azure Data Factory Data Flow
Azure Data Factory Data Flow Limited Preview for January 2019
Microsoft Azure Data Factory Data Flow Scenarios

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Mapping Data Flows Perf Tuning April 2021

  • 1. Azure Data Factory: Mapping Data Flows Performance Tuning Data Flows
  • 2. Agenda  Data Lake ETL Performance  Database ETL Performance  Transformation optimizations  Monitoring  Global Settings  Best Practices  Azure Integration Runtimes
  • 3. Data Lake ETL Performance
  • 4. Sample Timings 1 Scenario 1  Source: Delimited Text Blob Store  Sink: Azure SQL DB  File size: 421Mb, 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Current partitioning used throughout
  • 5. Sample timings 2  Scenario 2  Source: Delimited Text Blob Store  Sink: Delimited Text Blob store  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)
  • 6. File conversion Source->Sink property findings  Large data sizes should use more vcores(16+) with memory optimized or general purpose  Compute optimized does not improve performance in this scenario  CSV to parquet format convert has 45% time overhead in comparison with CSV to CSV  CSV to JSON format convert has 24% time overhead in comparison with CSV to CSV  CSV to JSON has better performance even though it has a lot of data to write  CSV to parquet has a slight lag because of time spent in decompression  Scaling V-cores improves performance for both IO and computation
  • 7. File Partitioning  Maintain current partitioning  Avoid output to single file  For manual partitioning, use number of cores from your Azure IR and multiply by 5  Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of partitions would be 32 x 5 = 160 partitions  If you know data well enough to have high-cardinality columns, use those columns as Hash partition  If you do not know data patterns very well, use Round Robin
  • 8. File Conversion Timing Compute type: General Purpose • Dataset has 36 Columns of string, integer, short, double • CSV dataset has 25 files with different file sizes • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 8 times more
  • 10. Sample timings for Azure SQL DB  Scenario w/Azure SQL DB  Source: Azure SQL DB Table  Sink: Azure SQL DB Table  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink
  • 12. Synapse DW Timing Compute type: General Purpose Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
  • 13. CosmosDB Timing Compute type: General Purpose
  • 15. Window / Aggregate Timing Compute type: General Purpose • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 5 times more
  • 16. Transformation Timings Compute type: General Purpose Transformation recommendations • When ranking data across entire dataset, use Rank transformation instead of Window with rank() • When using rowNumber() in Window to uniquely add a row counter to each row across entire dataset, instead use the Surrogate Key transformation
  • 17. TPCH Timings Compute type: General Purpose TPCH CSV in ADLS Gen 2
  • 18. Optimizing transformations  Each transformation has its own optimize tab  Generally better to not alter -> reshuffling is a relatively slow process  Reshuffling can occur if data is very skewed  One node has a disproportionate amount of data  For Joins, Exists and Lookups:  If you have a many of these transforms, memory optimized greatly increases performance  Use cached lookup w/cached sink  Can ‘Broadcast’ if the data on one side is small  Rule of thumb: Less than 50k rows  Use Window transformation partitioned over segments of data  For Rank() across entire dataset, use the Rank transformation instead  For RowNumber() across entire dataset, use the Surrogate Key transformation instead  Transformations that require reshuffling like Sort negatively impact performance
  • 20. Identifying bottlenecks 1. Cluster startup time 2. Sink processing time 3. Source read time 4. Transformation stage time 1. Sequential executions can lower the cluster startup time by setting a TTL in Azure IR 2. Total time to process the stream from source to sink. There is also a post-processing time when you click on the Sink that will show you how much time Spark had to spend with partition and job clean-up. Write to single file and slow database connections will increase this time 3. Shows you how long it took to read data from source. Optimize with different source partition strategies 4. This will show you bottlenecks in your transformation logic. With larger general purpose and mem optimized IRs, most of these operations occur in memory in data frames and are usually the fastest operations in your data flow
  • 21. Global configurations that effect performance  Logging level (pipeline activity)  Verbose (default) is most expensive  You can get a small increase in performance for large data flows without detailed logging  Trade-off: Less diagnostics  Error row handling (sink transformation)  Expect 5%-10% perf hit  Trade-off: Provides detailed logging and continuation of data flow on database driver errors  Run in parallel (pipeline activity)  Currently only available for “connected” streams, i.e. multiple sinks from a single stream  Can write to multiple sinks at same time  Use with new branch, conditional split  Parallel activity executions (pipeline activity)  If you place data flow activities on your pipeline canvas without connector lines, your data flows can all start at the same time, lowering overall pipeline execution times.
  • 22. ETL Performance Best Practices
  • 23. Best practices - Sources  When reading from file-based sources, data flow automatically partitions the data based on size  ~128 MB per partition, evenly distributed  Use current partitioning will be fastest for file-based and Synapse using PolyBase  Enable staging for Synapse  For Azure SQL DB, use Source partitioning on column with high cardinality  Improves performance, but can saturate your source database  Reading can be limited by the I/O of your source
  • 24. Best practices – Debug (Data Preview)  Data Preview  Data preview is inside the data flow designer transformation properties  Uses row limits and sampling techniques to preview data from a small size of data  Allows you to build and validate units of logic with samples of data in real time  You have control over the size of the data limits under Debug Settings  If you wish to test with larger datasets, set a larger compute size in the Azure IR when switching on “Debug Mode”  Data Preview is only a snapshot of data in memory from Spark data frames. This feature does not write any data, so the sink drivers are not utilized and not tested in this mode.
  • 25. Best practices – Debug (Pipeline Debug)  Pipeline Debug  Click debug button to test your data flow inside of a pipeline  Default debug limits the execution runtime so you will want to limit data sizes  Sampling can be applied here as well by using the “Enable Sampling” option in each Source  Use the debug button option of “use activity IR” when you wish to use a job execution compute environment  This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the default debug setting
  • 26. Best practices - Sinks  SQL:  Disable indexes on target with pre/post SQL scripts  Increase SQL capacity during pipeline execution  Enable staging when using Synapse  Use Source Partitioning on Source under Optimize  Set number of partitions based on size of IR  File-based sinks:  Use current partitioning allows Spark to create output  Output to single file is a slow operation  Often unnecessary by whoever is consuming data  Can set naming patterns or use data in column  Any reshuffling of data is slow  Cosmos DB  Set throughput and batch size to meet performance requirements
  • 27. Azure Integration Runtime Best Practices  Data Flows use JIT compute to minimize running expensive clusters when they are mostly idle  Generally more economical, but each cluster takes ~4 minutes to spin up  IR specifies what cluster type and core-count to use  Memory optimized is best, compute optimized doesn’t generally work for production workloads  When running Sequential jobs utilize Time to Live to reuse cluster between executions  Keeps compute resources alive for TTL minutes after execution for new job to use  Maximum one job per cluster  Reduces job startup latency to ~1.5 minutes  Click “Quick reuse” to lower sequential activity start-up times to < 10 seconds  Rule of thumb: start small and scale up
  • 28. Azure IR – General Purpose • This was General Purpose 4+4, the default auto resolve Azure IR • For prod workloads, GP is usually sufficient at >= 16 cores • You get 1 driver and 1 worker node, both with 4 vcores • Good for debugging, testing, and many production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 4 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 46s • Transformation time: 42s • Sink post-processing time: 45s
  • 29. Azure IR – Compute Optimized • Computed Optimized intended for smaller workloads • 8+8, this is smallest CO option and you get 1 driver and 2 workers • Not suitable for large production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 8 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 20s • Transformation time: 35s • Sink post-processing time: 40s • More worker nodes gave us more partitions and better perf than General Purpose
  • 30. Azure IR – Memory Optimized • Memory Optimized well suited for large production workload reliability with many aggregates, lookups, and joins • 64+16 gives you 16 vcores for driver and 64 across worker nodes • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 64 partitions • Cluster startup time: 4.8 mins • Sink IO writing: 19s • Transformation time: 17s • Sink post-processing time: 40s
  • 31. Resources  Complete Data Flows Performance Tuning and Profiles Deck  https://www2.slideshare.net/kromerm/azure-data-factory-data-flow-performance-tuning-101  Data Flows Training  https://www2.slideshare.net/kromerm/azure-data-factory-data-flows-training-sept-2020-update  Data Flows Video Tutorials  https://docs.microsoft.com/en-us/azure/data-factory/data-flow-tutorials  Data Flows Performance Home Page  https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance  Copy Data Performance Guidance  https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance