1.3 CLASS-DW.pptx-ETL process in details with detailed descriptions
1. ETL:- The process of loading data in data
warehouse
ETL process Operations
Data extractions
Dirty data
Data Transformation
Data cleaning steps
Loading dimensions & Facts
Type-1,2,3 Dimensional changes
main loading activities
Meta data & it’s needs
ETL Tool -- Testing 1
2. ETL Process
The data in a Data Warehouse system is loaded with
an ETL Extract, Transform, Load tool.
Three operations −
Extracts the data from your transactional system
which can be an Oracle, or any other relational
database.
Transforms the data by performing data cleansing
operations.
Loads the data into the OLAP data Warehouse.
Extract data from flat files like spreadsheets and CSV
files using an ETL tool and load it into DW for data
analysis and reporting. 2
3. Difference between ETL and BI Tools
An ETL tool is used to extract data from different data
sources, transform the data, and load it into a DW system.
The most common ETL tools include − SAP BO Data Services
BODSBODS, Informatica – Power Center, Microsoft – SSIS,
Oracle Data Integrator ODI, Talend Open Studio, Clover ETL
Open source, etc.
BI tool is used to generate reports for end-users, dashboard
for senior management, data visualizations for monthly,
quarterly, and annual board meetings.
Some popular BI tools include − SAP Business Objects, SAP
Lumira, IBM Cognos, JasperSoft, Microsoft BI Platform,
Tableau, Oracle Business Intelligence Enterprise Edition, etc.
3
4. 1. Data Extraction
Data is extracted from heterogeneous data sources.
Each data source has distinct characteristics that need to be
managed and integrated into the ETL system in order to effectively
extract data.
The analysis of Data source system is divided into two phases:
Data discovery phase:- key criterion for the success of the data
warehouse is the cleanliness and cohesiveness of the data within it.
Target needs-- identify and examine data sources.
Anomaly detection phase:-
NULL values. An unhandled NULL value can destroy any ETL
process. NULL values pose the biggest risk when they are in
foreign key columns. Joining two or more tables based on a column
that contains NULL values will cause data loss!
Dates in nondate fields. Dates are very peculiar elements because
they are the only logical elements that can come in various formats.
most DW systems support most of the various formats for display
purposes but store them in a single standard format
4
5. Data Extraction
Data extraction is Often performed by COBOL routines
(not recommended because of high program maintenance
and no automatically generated meta data).
Sometimes source data is copied to the target database
using the replication capabilities (not recommended because
of “dirty data” in the source systems).
Increasing performed by specialized ETL software.
6. Reasons for “Dirty” Data
Dummy Values
Absence of Data
Multipurpose Fields
Puzzling Data
Contradicting Data
Violation of Business Rules
Reused Primary Keys,
Non-Unique Identifiers
Integrity Problems
6
7. 2. Data Transformation(Cleaning)
Main step where the ETL adds value.
Actually changes data and provides guidance
whether data can be used for its intended purposes.
Performed in staging area.
Data Quality paradigm
Correct
Unambiguous
Consistent
Complete
Data quality checks are run at 2 places - after
extraction and after cleaning and confirming . 7
8. Data Cleansing
process of detecting and correcting (or removing) corrupt
or inaccurate records from a record set, table, or database
and refers to identifying incomplete, incorrect, inaccurate
or irrelevant parts of the data.
Source systems contain “dirty data” that must be cleaned.
ETL software contains basic data cleansing capabilities.
Specialized data cleaning software is often used.
Leading data cleansing vendors include Vality (Integrity),
Harte-Hanks (Trillium), and Firstlogic (i.d.Centric).
8
9. Steps in Data Cleansing
1. Parsing
2. Correcting
3. Standardizing
4. Matching
5. Consolidating
9
10. 1. Parsing:-
Parsing locates and identifies individual data elements in the source
files and then separates these data elements in the target files.
Examples:- parsing the first, middle, and last name; street number
and street name; and city and state.
2. Correcting:-
Corrects parsed data components using sophisticated data
algorithms and secondary data sources.
Example:- Modifying a address and adding a zip code.
10
11. 3. Standardizing:-
Standardizing applies conversion routines to
transform data into its preferred (and consistent)
format using both standard and custom business
rules.
Examples:- adding a pre name, replacing a
nickname.
4. Matching:-
Searching and matching records within and across
the parsed, corrected and standardized data based
on predefined business rules to eliminate
duplications.
Examples:- identifying similar names and addresses.
11
12. 5. Consolidating:-
Analyzing and identifying relationships between
matched records and merging them into ONE
representation.
What is Data Staging ?
Temporary step between data extraction and later
steps.
Accumulates data from asynchronous sources using
native interfaces, FTP sessions, or other processes.
At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse.
There is usually no end user access to the staging file.
12
13. Data Transformation strategy:
Transforms data in accordance with the business rules
and standards that have been established.
Example: format changes, de-duplication, splitting up
fields, derived values.
Data Loading:-
Data are physically moved to the data warehouse
The loading takes place within a “load window”.
The trend is to near real time updates of the data
warehouse as the warehouse is increasingly used for
operational applications {only change data rather than a
bulk reloading }.
13
14. 3. Data Loading (Load Dimensions & Facts)
1. Loading Dimensions:-
Physically built minimal sets of components.
Surrogate key is any column or set of columns that can be
declared as the primary key instead of a "real" or natural key.
De-normalized flat tables – all attributes in a dimension must
take on a single value in the presence of a dimension primary
key.
Steps:-
1. managing slowly changing dimensions ,
2. creating the surrogate key
3. loading dimensions with appropriate structure,
primary keys, natural keys and descriptive
attributes.
14
15. 3. Data Loading (Load Dimensions & Facts)
Data loading module consists of all the steps required to
Monitor slowly changing dimensions (SCD) and write
dimension to disk as a physical table.
15
16. 3. Data Loading (Load Dimensions & Facts)
1. Loading Dimensions:-
When DW receives notification that an
existing row in dimension has changed it
gives out 3 types of responses:-
Type 1---- Type 2----- Type 3 Slow changing Dimension
Type-1 Dimensional Changes:-
16
Example:-
Package Type
Changes from
Glass to Plastic
18. 3. Data Loading (Load Dimensions & Facts)
2. Loading Facts:-
Fact tables hold the measurements of an enterprise.
The relationship between fact tables and measurements
is extremely simple.
If a measurement exists, it can be modeled as a fact table
row. If a fact table row exists, it is a measurement .
18
19. 3. Data Loading (Load Dimensions & Facts)
Activities During Loading Facts:-
1. Managing Indexes
Analyze Performance Killers at load time
Drop all indexes in pre-load time
Separate Updates
Load updates
Rebuild indexes
2. Managing Partitions
Partitions allow a table (and its indexes) to be physically
divided into mini tables for administrative purposes and to
improve query performance.
The most common partitioning strategy on fact tables is to
partition the table by the date key. Because the date
dimension is preloaded and static.
19
20. 3. Data Loading (Load Dimensions & Facts)
Q….Why DW does not need rollback log ???
All data is entered by a managed process—the
ETL system.
Data is loaded in bulk.
Data can easily be reloaded if a load process fails.
Each DW system has different logging features
and manages its rollback log differently
20
21. Meta Data & it’s Needs
Data about data.
Needed by IT persons and users.
IT personnel need to know data sources and targets; database,
table and column names, data usage measures, etc.
Users need to know Dimensional entity/attribute definitions;
reports/query tools available, report delivery information, etc.
21
22. ETL Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
23. Function of ETL Tool
Staging Layer − The staging layer or staging database is used to store the data
extracted from different source data systems.
Data Integration Layer − The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged
into hierarchical groups. (dimensions, and into facts and aggregate facts). The
combination of facts and dimensions tables in a DW system is called a schema.
Access Layer − The access layer is used by end-users to retrieve the data for
analytical reporting and information. 23
24. ETL Testing
ETL testing is normally performed on data in a data warehouse
system, whereas database testing is commonly performed on
transactional systems where the data comes from different
applications into the transactional database.
ETL testing involves the following operations −
1. Validation of data movement from the source to the target
system.
2. Verification of data count in the source and the target system.
3. Verifying data extraction, transformation as per requirement
and expectation.
4. Verifying if table relations – joins and keys – are preserved
during the transformation.
Common ETL testing tools include Query Surge, Informatica.
24
25. Category of ETL testing
Source to Target Count Testing − It involves
matching of count of records in the source and the
target systems.
Source to Target Data Testing − It involves data
validation between the source and the target
systems. It also involves data integration and
threshold value check and duplicate data check in
the target system.
Data Mapping or Transformation Testing − It
confirms the mapping of objects in the source and
the target systems. It also involves checking the
functionality of data in the target system. 25
26. End-User Testing − It involves generating reports
for end-users to verify if the data in the reports
are as per expectation. It involves finding
deviation in reports and cross-check the data in
the target system for report validation.
Retesting − It involves fixing the bugs and defects
in data in the target system and running the
reports again for data validation.
System Integration Testing − It involves testing all
the individual systems, and later combine the
results to find if there are any deviations. There
are three approaches that can be used to perform
this: top-down, bottom-up, and hybrid. 26
27. ETL operation(Example)
Clickstream Data Analysis
• Results from clicks at web sites.
• A dialog manager handles user interactions.
An ODS (operational data store in the data
staging area) helps to custom tailor the dialog.
• The clickstream data is filtered and parsed and
sent to a data warehouse where it is analyzed.
• Analytic Software is available to analyze the
clickstream data.
28. 28
THE FUTURE OF DATA WAREHOUSING AND ANALYTICS
https://mariadb.com/kb/en/
Building an ETL pipeline from scratch in 30 minutes
https://www.youtube.com/watch?v=hjwKKgWbMF0
Easy ETL Program with Python
https://www.youtube.com/watch?v=7O9bosBS8WM
For research..
International Journal of Data Warehousing and Mining
#3:A data dashboard is an information management tool that visually tracks, analyzes and displays key performance indicators (KPI), metrics and key data points to monitor the health of a business, department or specific process.
#13:2. Data Transformation -------Data Loading Starts--->>>
#17:Already we have discussed details in previous modules about “ slow changing Dimensions”
#27:The ODS is used to support the web site dialog -- an operational process -- while the data in the warehouse is analyzed -- to better understand customers and their use of the web site.