SlideShare a Scribd company logo
ETL:- The process of loading data in data
warehouse
 ETL process Operations
 Data extractions
 Dirty data
 Data Transformation
 Data cleaning steps
 Loading dimensions & Facts
 Type-1,2,3 Dimensional changes
 main loading activities
 Meta data & it’s needs
 ETL Tool -- Testing 1
ETL Process
 The data in a Data Warehouse system is loaded with
an ETL Extract, Transform, Load tool.
 Three operations −
 Extracts the data from your transactional system
which can be an Oracle, or any other relational
database.
 Transforms the data by performing data cleansing
operations.
 Loads the data into the OLAP data Warehouse.
 Extract data from flat files like spreadsheets and CSV
files using an ETL tool and load it into DW for data
analysis and reporting. 2
Difference between ETL and BI Tools
 An ETL tool is used to extract data from different data
sources, transform the data, and load it into a DW system.
 The most common ETL tools include − SAP BO Data Services
BODSBODS, Informatica – Power Center, Microsoft – SSIS,
Oracle Data Integrator ODI, Talend Open Studio, Clover ETL
Open source, etc.
 BI tool is used to generate reports for end-users, dashboard
for senior management, data visualizations for monthly,
quarterly, and annual board meetings.
 Some popular BI tools include − SAP Business Objects, SAP
Lumira, IBM Cognos, JasperSoft, Microsoft BI Platform,
Tableau, Oracle Business Intelligence Enterprise Edition, etc.
3
1. Data Extraction
 Data is extracted from heterogeneous data sources.
 Each data source has distinct characteristics that need to be
managed and integrated into the ETL system in order to effectively
extract data.
The analysis of Data source system is divided into two phases:
Data discovery phase:- key criterion for the success of the data
warehouse is the cleanliness and cohesiveness of the data within it.
Target needs-- identify and examine data sources.
Anomaly detection phase:-
 NULL values. An unhandled NULL value can destroy any ETL
process. NULL values pose the biggest risk when they are in
foreign key columns. Joining two or more tables based on a column
that contains NULL values will cause data loss!
 Dates in nondate fields. Dates are very peculiar elements because
they are the only logical elements that can come in various formats.
most DW systems support most of the various formats for display
purposes but store them in a single standard format
4
Data Extraction
 Data extraction is Often performed by COBOL routines
(not recommended because of high program maintenance
and no automatically generated meta data).
 Sometimes source data is copied to the target database
using the replication capabilities (not recommended because
of “dirty data” in the source systems).
 Increasing performed by specialized ETL software.
Reasons for “Dirty” Data
 Dummy Values
 Absence of Data
 Multipurpose Fields
 Puzzling Data
 Contradicting Data
 Violation of Business Rules
 Reused Primary Keys,
 Non-Unique Identifiers
 Integrity Problems
6
2. Data Transformation(Cleaning)
 Main step where the ETL adds value.
 Actually changes data and provides guidance
whether data can be used for its intended purposes.
 Performed in staging area.
Data Quality paradigm
 Correct
 Unambiguous
 Consistent
 Complete
 Data quality checks are run at 2 places - after
extraction and after cleaning and confirming . 7
Data Cleansing
 process of detecting and correcting (or removing) corrupt
or inaccurate records from a record set, table, or database
and refers to identifying incomplete, incorrect, inaccurate
or irrelevant parts of the data.
 Source systems contain “dirty data” that must be cleaned.
 ETL software contains basic data cleansing capabilities.
 Specialized data cleaning software is often used.
 Leading data cleansing vendors include Vality (Integrity),
Harte-Hanks (Trillium), and Firstlogic (i.d.Centric).
8
Steps in Data Cleansing
1. Parsing
2. Correcting
3. Standardizing
4. Matching
5. Consolidating
9
1. Parsing:-
 Parsing locates and identifies individual data elements in the source
files and then separates these data elements in the target files.
 Examples:- parsing the first, middle, and last name; street number
and street name; and city and state.
 2. Correcting:-
 Corrects parsed data components using sophisticated data
algorithms and secondary data sources.
 Example:- Modifying a address and adding a zip code.
10
3. Standardizing:-
 Standardizing applies conversion routines to
transform data into its preferred (and consistent)
format using both standard and custom business
rules.
 Examples:- adding a pre name, replacing a
nickname.
4. Matching:-
 Searching and matching records within and across
the parsed, corrected and standardized data based
on predefined business rules to eliminate
duplications.
 Examples:- identifying similar names and addresses.
11
5. Consolidating:-
 Analyzing and identifying relationships between
matched records and merging them into ONE
representation.
What is Data Staging ?
 Temporary step between data extraction and later
steps.
 Accumulates data from asynchronous sources using
native interfaces, FTP sessions, or other processes.
 At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse.
 There is usually no end user access to the staging file.
12
Data Transformation strategy:
 Transforms data in accordance with the business rules
and standards that have been established.
 Example: format changes, de-duplication, splitting up
fields, derived values.
Data Loading:-
 Data are physically moved to the data warehouse
 The loading takes place within a “load window”.
 The trend is to near real time updates of the data
warehouse as the warehouse is increasingly used for
operational applications {only change data rather than a
bulk reloading }.
13
3. Data Loading (Load Dimensions & Facts)
1. Loading Dimensions:-
 Physically built minimal sets of components.
 Surrogate key is any column or set of columns that can be
declared as the primary key instead of a "real" or natural key.
 De-normalized flat tables – all attributes in a dimension must
take on a single value in the presence of a dimension primary
key.
Steps:-
1. managing slowly changing dimensions ,
2. creating the surrogate key
3. loading dimensions with appropriate structure,
primary keys, natural keys and descriptive
attributes.
14
3. Data Loading (Load Dimensions & Facts)
Data loading module consists of all the steps required to
Monitor slowly changing dimensions (SCD) and write
dimension to disk as a physical table.
15
3. Data Loading (Load Dimensions & Facts)
1. Loading Dimensions:-
 When DW receives notification that an
existing row in dimension has changed it
gives out 3 types of responses:-
Type 1---- Type 2----- Type 3 Slow changing Dimension
Type-1 Dimensional Changes:-
16
Example:-
Package Type
Changes from
Glass to Plastic
3. Data Loading (Load Dimensions & Facts)
Type-2 Dimensional Changes:-
Type-3 Dimensional Changes:-
17
3. Data Loading (Load Dimensions & Facts)
2. Loading Facts:-
 Fact tables hold the measurements of an enterprise.
 The relationship between fact tables and measurements
is extremely simple.
 If a measurement exists, it can be modeled as a fact table
row. If a fact table row exists, it is a measurement .
18
3. Data Loading (Load Dimensions & Facts)
Activities During Loading Facts:-
1. Managing Indexes
 Analyze Performance Killers at load time
 Drop all indexes in pre-load time
 Separate Updates
 Load updates
 Rebuild indexes
2. Managing Partitions
 Partitions allow a table (and its indexes) to be physically
divided into mini tables for administrative purposes and to
improve query performance.
 The most common partitioning strategy on fact tables is to
partition the table by the date key. Because the date
dimension is preloaded and static.
19
3. Data Loading (Load Dimensions & Facts)
Q….Why DW does not need rollback log ???
 All data is entered by a managed process—the
ETL system.
 Data is loaded in bulk.
 Data can easily be reloaded if a load process fails.
 Each DW system has different logging features
and manages its rollback log differently
20
Meta Data & it’s Needs
 Data about data.
 Needed by IT persons and users.
 IT personnel need to know data sources and targets; database,
table and column names, data usage measures, etc.
 Users need to know Dimensional entity/attribute definitions;
reports/query tools available, report delivery information, etc.
21
ETL Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Function of ETL Tool
Staging Layer − The staging layer or staging database is used to store the data
extracted from different source data systems.
Data Integration Layer − The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged
into hierarchical groups. (dimensions, and into facts and aggregate facts). The
combination of facts and dimensions tables in a DW system is called a schema.
Access Layer − The access layer is used by end-users to retrieve the data for
analytical reporting and information. 23
ETL Testing
 ETL testing is normally performed on data in a data warehouse
system, whereas database testing is commonly performed on
transactional systems where the data comes from different
applications into the transactional database.
ETL testing involves the following operations −
1. Validation of data movement from the source to the target
system.
2. Verification of data count in the source and the target system.
3. Verifying data extraction, transformation as per requirement
and expectation.
4. Verifying if table relations – joins and keys – are preserved
during the transformation.
 Common ETL testing tools include Query Surge, Informatica.
24
Category of ETL testing
 Source to Target Count Testing − It involves
matching of count of records in the source and the
target systems.
 Source to Target Data Testing − It involves data
validation between the source and the target
systems. It also involves data integration and
threshold value check and duplicate data check in
the target system.
 Data Mapping or Transformation Testing − It
confirms the mapping of objects in the source and
the target systems. It also involves checking the
functionality of data in the target system. 25
 End-User Testing − It involves generating reports
for end-users to verify if the data in the reports
are as per expectation. It involves finding
deviation in reports and cross-check the data in
the target system for report validation.
 Retesting − It involves fixing the bugs and defects
in data in the target system and running the
reports again for data validation.
 System Integration Testing − It involves testing all
the individual systems, and later combine the
results to find if there are any deviations. There
are three approaches that can be used to perform
this: top-down, bottom-up, and hybrid. 26
ETL operation(Example)
Clickstream Data Analysis
• Results from clicks at web sites.
• A dialog manager handles user interactions.
An ODS (operational data store in the data
staging area) helps to custom tailor the dialog.
• The clickstream data is filtered and parsed and
sent to a data warehouse where it is analyzed.
• Analytic Software is available to analyze the
clickstream data.
28
THE FUTURE OF DATA WAREHOUSING AND ANALYTICS
https://mariadb.com/kb/en/
Building an ETL pipeline from scratch in 30 minutes
https://www.youtube.com/watch?v=hjwKKgWbMF0
Easy ETL Program with Python
https://www.youtube.com/watch?v=7O9bosBS8WM
For research..
International Journal of Data Warehousing and Mining
29
ACTIVITY-04[GROUP]
Investigate the design and operational aspects
of an ETL engine for any Data warehousing
application.
30
Thanks for Patience!!
Query?

More Related Content

PPTX
ETL_Methodology.pptx
PPT
ETL Testing - Introduction to ETL testing
PPT
ETL Testing - Introduction to ETL Testing
PPT
ETL Testing - Introduction to ETL testing
PPTX
Etl - Extract Transform Load
PPT
extract, transform, load_Data Analyt.ppt
PPT
Etl data processing system which is very useful for the engineering students
PPT
Datastage Introduction To Data Warehousing
ETL_Methodology.pptx
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL testing
Etl - Extract Transform Load
extract, transform, load_Data Analyt.ppt
Etl data processing system which is very useful for the engineering students
Datastage Introduction To Data Warehousing

Similar to 1.3 CLASS-DW.pptx-ETL process in details with detailed descriptions (20)

PPT
Introduction to ETL Data Warehousing.ppt
PPT
D01 etl
PPTX
ETL-Datawarehousing.ppt.pptx
PPTX
Extract, Transform and Load.pptx
PPT
definign etl process extract transform load.ppt
PPTX
What is ETL?
PPT
Building the DW - ETL
PDF
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
PPT
Intro to Data warehousing lecture 09
PPT
Data Warehouse Basic Guide
PPTX
ETL
PPTX
Etl process in data warehouse
DOCX
Etl techniques
PPTX
ETL Testing Overview
PPT
Lecture 16
PDF
Data Ware House Testing
PDF
An Overview on Data Quality Issues at Data Staging ETL
PPTX
Introduction to ETL process
PDF
Data warehouse-testing
PPTX
Intro to Data warehousing lecture 10
Introduction to ETL Data Warehousing.ppt
D01 etl
ETL-Datawarehousing.ppt.pptx
Extract, Transform and Load.pptx
definign etl process extract transform load.ppt
What is ETL?
Building the DW - ETL
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
Intro to Data warehousing lecture 09
Data Warehouse Basic Guide
ETL
Etl process in data warehouse
Etl techniques
ETL Testing Overview
Lecture 16
Data Ware House Testing
An Overview on Data Quality Issues at Data Staging ETL
Introduction to ETL process
Data warehouse-testing
Intro to Data warehousing lecture 10
Ad

Recently uploaded (20)

PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Foundation of Data Science unit number two notes
Launch Your Data Science Career in Kochi – 2025
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
climate analysis of Dhaka ,Banglades.pptx
Ad

1.3 CLASS-DW.pptx-ETL process in details with detailed descriptions

  • 1. ETL:- The process of loading data in data warehouse  ETL process Operations  Data extractions  Dirty data  Data Transformation  Data cleaning steps  Loading dimensions & Facts  Type-1,2,3 Dimensional changes  main loading activities  Meta data & it’s needs  ETL Tool -- Testing 1
  • 2. ETL Process  The data in a Data Warehouse system is loaded with an ETL Extract, Transform, Load tool.  Three operations −  Extracts the data from your transactional system which can be an Oracle, or any other relational database.  Transforms the data by performing data cleansing operations.  Loads the data into the OLAP data Warehouse.  Extract data from flat files like spreadsheets and CSV files using an ETL tool and load it into DW for data analysis and reporting. 2
  • 3. Difference between ETL and BI Tools  An ETL tool is used to extract data from different data sources, transform the data, and load it into a DW system.  The most common ETL tools include − SAP BO Data Services BODSBODS, Informatica – Power Center, Microsoft – SSIS, Oracle Data Integrator ODI, Talend Open Studio, Clover ETL Open source, etc.  BI tool is used to generate reports for end-users, dashboard for senior management, data visualizations for monthly, quarterly, and annual board meetings.  Some popular BI tools include − SAP Business Objects, SAP Lumira, IBM Cognos, JasperSoft, Microsoft BI Platform, Tableau, Oracle Business Intelligence Enterprise Edition, etc. 3
  • 4. 1. Data Extraction  Data is extracted from heterogeneous data sources.  Each data source has distinct characteristics that need to be managed and integrated into the ETL system in order to effectively extract data. The analysis of Data source system is divided into two phases: Data discovery phase:- key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it. Target needs-- identify and examine data sources. Anomaly detection phase:-  NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss!  Dates in nondate fields. Dates are very peculiar elements because they are the only logical elements that can come in various formats. most DW systems support most of the various formats for display purposes but store them in a single standard format 4
  • 5. Data Extraction  Data extraction is Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data).  Sometimes source data is copied to the target database using the replication capabilities (not recommended because of “dirty data” in the source systems).  Increasing performed by specialized ETL software.
  • 6. Reasons for “Dirty” Data  Dummy Values  Absence of Data  Multipurpose Fields  Puzzling Data  Contradicting Data  Violation of Business Rules  Reused Primary Keys,  Non-Unique Identifiers  Integrity Problems 6
  • 7. 2. Data Transformation(Cleaning)  Main step where the ETL adds value.  Actually changes data and provides guidance whether data can be used for its intended purposes.  Performed in staging area. Data Quality paradigm  Correct  Unambiguous  Consistent  Complete  Data quality checks are run at 2 places - after extraction and after cleaning and confirming . 7
  • 8. Data Cleansing  process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data.  Source systems contain “dirty data” that must be cleaned.  ETL software contains basic data cleansing capabilities.  Specialized data cleaning software is often used.  Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric). 8
  • 9. Steps in Data Cleansing 1. Parsing 2. Correcting 3. Standardizing 4. Matching 5. Consolidating 9
  • 10. 1. Parsing:-  Parsing locates and identifies individual data elements in the source files and then separates these data elements in the target files.  Examples:- parsing the first, middle, and last name; street number and street name; and city and state.  2. Correcting:-  Corrects parsed data components using sophisticated data algorithms and secondary data sources.  Example:- Modifying a address and adding a zip code. 10
  • 11. 3. Standardizing:-  Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.  Examples:- adding a pre name, replacing a nickname. 4. Matching:-  Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.  Examples:- identifying similar names and addresses. 11
  • 12. 5. Consolidating:-  Analyzing and identifying relationships between matched records and merging them into ONE representation. What is Data Staging ?  Temporary step between data extraction and later steps.  Accumulates data from asynchronous sources using native interfaces, FTP sessions, or other processes.  At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse.  There is usually no end user access to the staging file. 12
  • 13. Data Transformation strategy:  Transforms data in accordance with the business rules and standards that have been established.  Example: format changes, de-duplication, splitting up fields, derived values. Data Loading:-  Data are physically moved to the data warehouse  The loading takes place within a “load window”.  The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications {only change data rather than a bulk reloading }. 13
  • 14. 3. Data Loading (Load Dimensions & Facts) 1. Loading Dimensions:-  Physically built minimal sets of components.  Surrogate key is any column or set of columns that can be declared as the primary key instead of a "real" or natural key.  De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key. Steps:- 1. managing slowly changing dimensions , 2. creating the surrogate key 3. loading dimensions with appropriate structure, primary keys, natural keys and descriptive attributes. 14
  • 15. 3. Data Loading (Load Dimensions & Facts) Data loading module consists of all the steps required to Monitor slowly changing dimensions (SCD) and write dimension to disk as a physical table. 15
  • 16. 3. Data Loading (Load Dimensions & Facts) 1. Loading Dimensions:-  When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses:- Type 1---- Type 2----- Type 3 Slow changing Dimension Type-1 Dimensional Changes:- 16 Example:- Package Type Changes from Glass to Plastic
  • 17. 3. Data Loading (Load Dimensions & Facts) Type-2 Dimensional Changes:- Type-3 Dimensional Changes:- 17
  • 18. 3. Data Loading (Load Dimensions & Facts) 2. Loading Facts:-  Fact tables hold the measurements of an enterprise.  The relationship between fact tables and measurements is extremely simple.  If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement . 18
  • 19. 3. Data Loading (Load Dimensions & Facts) Activities During Loading Facts:- 1. Managing Indexes  Analyze Performance Killers at load time  Drop all indexes in pre-load time  Separate Updates  Load updates  Rebuild indexes 2. Managing Partitions  Partitions allow a table (and its indexes) to be physically divided into mini tables for administrative purposes and to improve query performance.  The most common partitioning strategy on fact tables is to partition the table by the date key. Because the date dimension is preloaded and static. 19
  • 20. 3. Data Loading (Load Dimensions & Facts) Q….Why DW does not need rollback log ???  All data is entered by a managed process—the ETL system.  Data is loaded in bulk.  Data can easily be reloaded if a load process fails.  Each DW system has different logging features and manages its rollback log differently 20
  • 21. Meta Data & it’s Needs  Data about data.  Needed by IT persons and users.  IT personnel need to know data sources and targets; database, table and column names, data usage measures, etc.  Users need to know Dimensional entity/attribute definitions; reports/query tools available, report delivery information, etc. 21
  • 22. ETL Tool Vendor Oracle Warehouse Builder (OWB) Oracle Data Integrator (BODI) Business Objects IBM Information Server (Ascential) IBM SAS Data Integration Studio SAS Institute PowerCenter Informatica Oracle Data Integrator (Sunopsis) Oracle Data Migrator Information Builders Integration Services Microsoft Talend Open Studio Talend DataFlow Group 1 Software (Sagent) Data Integrator Pervasive Transformation Server DataMirror Transformation Manager ETL Solutions Ltd. Data Manager Cognos DT/Studio Embarcadero Technologies ETL4ALL IKAN DB2 Warehouse Edition IBM Jitterbit Jitterbit
  • 23. Function of ETL Tool Staging Layer − The staging layer or staging database is used to store the data extracted from different source data systems. Data Integration Layer − The integration layer transforms the data from the staging layer and moves the data to a database, where the data is arranged into hierarchical groups. (dimensions, and into facts and aggregate facts). The combination of facts and dimensions tables in a DW system is called a schema. Access Layer − The access layer is used by end-users to retrieve the data for analytical reporting and information. 23
  • 24. ETL Testing  ETL testing is normally performed on data in a data warehouse system, whereas database testing is commonly performed on transactional systems where the data comes from different applications into the transactional database. ETL testing involves the following operations − 1. Validation of data movement from the source to the target system. 2. Verification of data count in the source and the target system. 3. Verifying data extraction, transformation as per requirement and expectation. 4. Verifying if table relations – joins and keys – are preserved during the transformation.  Common ETL testing tools include Query Surge, Informatica. 24
  • 25. Category of ETL testing  Source to Target Count Testing − It involves matching of count of records in the source and the target systems.  Source to Target Data Testing − It involves data validation between the source and the target systems. It also involves data integration and threshold value check and duplicate data check in the target system.  Data Mapping or Transformation Testing − It confirms the mapping of objects in the source and the target systems. It also involves checking the functionality of data in the target system. 25
  • 26.  End-User Testing − It involves generating reports for end-users to verify if the data in the reports are as per expectation. It involves finding deviation in reports and cross-check the data in the target system for report validation.  Retesting − It involves fixing the bugs and defects in data in the target system and running the reports again for data validation.  System Integration Testing − It involves testing all the individual systems, and later combine the results to find if there are any deviations. There are three approaches that can be used to perform this: top-down, bottom-up, and hybrid. 26
  • 27. ETL operation(Example) Clickstream Data Analysis • Results from clicks at web sites. • A dialog manager handles user interactions. An ODS (operational data store in the data staging area) helps to custom tailor the dialog. • The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed. • Analytic Software is available to analyze the clickstream data.
  • 28. 28 THE FUTURE OF DATA WAREHOUSING AND ANALYTICS https://mariadb.com/kb/en/ Building an ETL pipeline from scratch in 30 minutes https://www.youtube.com/watch?v=hjwKKgWbMF0 Easy ETL Program with Python https://www.youtube.com/watch?v=7O9bosBS8WM For research.. International Journal of Data Warehousing and Mining
  • 29. 29 ACTIVITY-04[GROUP] Investigate the design and operational aspects of an ETL engine for any Data warehousing application.

Editor's Notes

  • #3: A data dashboard is an information management tool that visually tracks, analyzes and displays key performance indicators (KPI), metrics and key data points to monitor the health of a business, department or specific process.
  • #8: 2. Data Transformation
  • #9: 2. Data Transformation
  • #10: 2. Data Transformation
  • #11: 2. Data Transformation
  • #12: 2. Data Transformation
  • #13: 2. Data Transformation -------Data Loading Starts--->>>
  • #17: Already we have discussed details in previous modules about “ slow changing Dimensions”
  • #27: The ODS is used to support the web site dialog -- an operational process -- while the data in the warehouse is analyzed -- to better understand customers and their use of the web site.