SlideShare a Scribd company logo
2
Most read
3
Most read
6
Most read
BUSINESS
ANALYTICS
813
E T L
JESUSA LAO ESPELETA
By:
Extracting
Transforming
Loading
ETL, what is it?
• ETL (Extract-Transform-Load)
is a data integration process that integrates data
from numerous data sources into a single,
consistent store that is then put into a data
warehouse or other destination system.
What does ETL do?
• Migrates data from one DB to another
• Converts DB from one format or type to another
• reads data from an input source ;
• Performs at least three specific functions
• Enables loading of multiple target databases
• Forms data marts and data warehouses
• Transforms the data to make it accessible to business analysis
• writes the resultant data set back out to a flat file, relational table,
etc.
ETL
how it works in Data
Warehouse?
EXTRACT
Raw data is copied
or exported from
source locations to a
staging area during
data extraction.
TRANSFORM
In this stage, raw data
undergoes data
processing. Also in this
step, the data is
processed and
consolidated for its
intended analytical
use case.
LOAD
The converted data
is moved from the
staging area into the
target data
warehouse in this
final stage.
Incremental extract = capturing
changes that have occurred since
the last static extract.
ETLProcess–Step1:
Operational
System
Enterprise Data
Warehouse
Data Reconciliation
Capture = extract… obtaining a snapshot
of a chosen subset of the source data for
loading into the data warehouse.
Static extract = capturing a
snapshot of the source data
at a point in time.
capture..
Also: decoding, reformatting,
time stamping, conversion, key
generation, merging, error
detection/logging & locating
missing data.
ETLProcess–Step2:
Operational
System
Enterprise Data
Warehouse
Data Reconciliation
Scrub = cleanse… uses pattern recognition
and AI techniques to upgrade data quality.
Fixing errors: misspellings,
erroneous dates, incorrect field
usage, mismatched addresses,
missing data, duplicate data &
inconsistencies.
capture..
scrub ..
Field-level:
SINGLE FIELD – from one field to one field
MULTI FIELD – from many fields to one, or field
to many
ETLProcess–Step3:
Operational
System
Enterprise Data
Warehouse
Data Reconciliation
Transform = convert data from format
of operational system to format of data
warehouse.
Record-level:
SELECTION – data partitioning
JOINING – data combining
AGGREGATION – data summarization
capture..
scrub ..
transform ..
Update mode: only changes in source
data are written to data warehouse.
ETLProcess–Step4:
Operational
System
Enterprise Data
Warehouse
Data Reconciliation
Load/Index= place transformed data into
the warehouse and create indexes.
Refresh mode: bulk rewriting of
target data at periodic intervals.
capture..
scrub ..
transform ..
NOTE
• The staging area should be accessed only by the ETL process. It should never be
available to anyone else, particularly not to end users as it is not intended for data
presentation to the end-user as may contain an incomplete data.
Types of ETL
Testing
Production Validation Testing
Testing Process
Source to TargetTesting (ValidationTesting)
Application Upgrades
Metadata Testing
Data Completeness Testing
Data Quality Testing
Data AccuracyTesting
GUI/Navigation Testing
Data Validation Option provides the ETL testing automation and management
capabilities to ensure that production systems are not compromised by the data.
Such type of testing is carried out to validate whether the data values
transformed are the expected data values.
Such type of ETL testing can be automatically generated, saving substantial test
development time. This type of testing checks whether the data extracted from an
older application or repository are exactly same as the data in a repository or new
application.
Metadata testing includes testing of data type check, data length check
and index/constraint check.
To verify that all the expected data is loaded in target from the source, data
completeness testing is done. Some of the tests that can be run are compare
and validate counts, aggregates and actual data between the source and target
for columns with simple transformation or no transformation.
Data quality testing includes number check, date check, precision
check, data check , null check etc.
This testing is done to ensure that the data is accurately loaded and transformed as
expected.
This testing is done to check the navigation or GUI aspects of the front
end reports.
operational
system
customers
communications
SQL
flatfiles
Data Sources
DataValidation
DataCleaning
Data
Transforming
Data
Aggregating
DataLoading
Data warehouse
D a ta m a rts
BI Results
Simplified Data Flow
OLAP
Analysis
Data
Mining
Data
Visualization
Reports
Dashboards
Alerts
ETL
Enduser
Difference between
ETL Testing and Database
Testing
ETL
TESTING
DATABASE
TESTING
Verifies whether data is moved as expected.
Verifies whether counts in the source and
target
are matching.
Verifies whether the data transformed is
as
per expectation.
Verifies that the foreign primary key relations
are preserved during the ETL.
Verifies for duplication in loaded data.
The primary goal is to check if the data
is following the rules/ standards
defined in the Data Model.
Verify that there are no orphan
records and foreign-primary key
relations are maintained.
Verifies that there are no redundant tables
and database is optimally normalized.
Verify if data is missing in columns where
required.
ETLtool types,
What are they?
ETLtools have been available for more than 30 yrs.Varies types
of solution also have been in the markettechnology.
COMMERCIAL
OPEN SOURCE-
BASED
ETL
Process Strategy Types
• A Push strategy is initiated by the
Source System. As a part of the
Extraction Process, the source data can
be pushed/exported or dumped onto a
file location from where it can loaded
into a staging area.
• A Pull strategy is initiated by the Target
System. As a part of the Extraction
Process, the source data can be pulled
from Transactional system into a staging
area by establishing a connection to the
relational/flat/ODBC sources.
No additional burden
on the Transactional
systems when we
want to load data into
the staging database.
Additional space
required to store
the data that
needs to be
loaded into to the
staging database.
No additional space
required to store the
data that needs to be
loaded into to the
staging database.
Burden on the
Transactional systems
when we want to load
data into the staging
database.
5 Reasons for “DIRTY” Data :
1. FORMATTING ISSUES
Duplicate data is the bane of data analysts and data scientists and can lead to dire consequences if reported on-
-as numbers will be over-inflated.
3. IMPROPER INCLUSION
4. INCOMPLETE DATA
5. CONTRADICTORY DATA.
If your data formatting isn’t uniform across your whole set, you’re likely to run into some serious problems when it comes
time to crunch the numbers.
2. DUPLICATE DATA
Is less likely to come from transfer or formatting issues and instead to issues with data entry — data that has been entered
into the wrong category, with the wrong units, or comes from the wrong timespan.
An incomplete data set means that you’re not able to see the full picture of whatever you’re analyzing.
If you’re compiling data in an attempt to calculate your business’s revenue, you don’t want your expense numbers
included in the calculations.
Why is ETL important?
• Removes mistakes and corrects data.
• Captures the flow of transactional data.
• Adjusts data from multiple sources to be
used together.
• Enables subsequent business /
analyticaldata processing.
What can ETL be used?
• To acquire a temporarysubsetof data (like a VIEW) for
reportsor otherpurposes.A more permanent dataset
may beacquired for otherpurposes suchas the
population ofa data mart or data warehouse.
Question:
Since the ETL provides a mini-data-warehousecomponent that looks
remarkably like the data mart and performall the data extraction,
filtering, integration, classification and aggregation functions that the
data warehouse normally provides, why we need a extra data
warehouse as an duplicated part?
In Fact, when properly implemented, the data warehouse performs all data
preparation function instead of letting ETL performthose chores, so there is no
duplication of function. Better yet, the data warehouse handles the data component
much more efficiently than ETL does, so we can appreciate the benefits of having a
central data warehouse serve as the large enterprise decision support database.
Moreover, to provide better performance, ETL merge the data warehouse and data mart
approaches by storing small extracts of the data warehouse at end-user
workstations.
Extract, Transform and Load.pptx

More Related Content

PPTX
ETL_Methodology.pptx
PPTX
Introduction to ETL process
PPTX
Document Database
PDF
Data tidying with tidyr meetup
PDF
Data manipulation language
PPT
Data warehousing
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PDF
ETL Process & Data Warehouse Fundamentals
ETL_Methodology.pptx
Introduction to ETL process
Document Database
Data tidying with tidyr meetup
Data manipulation language
Data warehousing
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
ETL Process & Data Warehouse Fundamentals

What's hot (20)

PPTX
Query optimization
PPTX
ETL Process
PDF
OLAP in Data Warehouse
PPT
Dimensional Modeling
PPT
SQL.ppt
PPT
Data warehouse
PPT
7 data warehouse & marts
PDF
Data warehousing
PPT
Dw & etl concepts
PPTX
Commands of DML in SQL
PPTX
multi dimensional data model
PPTX
Data preprocessing
PPTX
OLAP v/s OLTP
PPTX
UNIT - 1 Part 2: Data Warehousing and Data Mining
PPTX
Exploratory Data Analysis
PDF
Big data Analytics
PPTX
Data extraction, transformation, and loading
PDF
Data warehouse architecture
PPTX
Query-porcessing-& Query optimization
PDF
UNIT2-Data Mining.pdf
Query optimization
ETL Process
OLAP in Data Warehouse
Dimensional Modeling
SQL.ppt
Data warehouse
7 data warehouse & marts
Data warehousing
Dw & etl concepts
Commands of DML in SQL
multi dimensional data model
Data preprocessing
OLAP v/s OLTP
UNIT - 1 Part 2: Data Warehousing and Data Mining
Exploratory Data Analysis
Big data Analytics
Data extraction, transformation, and loading
Data warehouse architecture
Query-porcessing-& Query optimization
UNIT2-Data Mining.pdf
Ad

Similar to Extract, Transform and Load.pptx (20)

PPT
definign etl process extract transform load.ppt
PPTX
Extract Transformation Load (3) (1).pptx
PPTX
Extract Transformation Loading1 (3).pptx
PPT
ETL Testing Training Presentation
PPTX
Lecture13- Extract Transform Load presentation.pptx
PPTX
What is ETL?
PDF
What Is ETL | Process of ETL 2023 | GrapesTech Solutions
PPTX
1.3 CLASS-DW.pptx-ETL process in details with detailed descriptions
PDF
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
PPTX
What is ETL testing & how to enforce it in Data Wharehouse
PPT
ETL (1).ppt
PPTX
“Extract, Load, Transform,” is another type of data integration process
PPTX
Etl - Extract Transform Load
PPT
ETL Testing - Introduction to ETL testing
PPT
ETL Testing - Introduction to ETL Testing
PPT
ETL Testing - Introduction to ETL testing
PDF
ETL testing training program in Hyderabad covers comprehensive topics
PPTX
Our ETL testing training program in Hyderabad covers comprehensive topics suc...
PPTX
GROPSIKS.pptx
PPTX
ETL Testing Overview
definign etl process extract transform load.ppt
Extract Transformation Load (3) (1).pptx
Extract Transformation Loading1 (3).pptx
ETL Testing Training Presentation
Lecture13- Extract Transform Load presentation.pptx
What is ETL?
What Is ETL | Process of ETL 2023 | GrapesTech Solutions
1.3 CLASS-DW.pptx-ETL process in details with detailed descriptions
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
What is ETL testing & how to enforce it in Data Wharehouse
ETL (1).ppt
“Extract, Load, Transform,” is another type of data integration process
Etl - Extract Transform Load
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL testing
ETL testing training program in Hyderabad covers comprehensive topics
Our ETL testing training program in Hyderabad covers comprehensive topics suc...
GROPSIKS.pptx
ETL Testing Overview
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Mega Projects Data Mega Projects Data
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Computer network topology notes for revision
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Introduction to Business Data Analytics.
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Lecture1 pattern recognition............
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
IB Computer Science - Internal Assessment.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Mega Projects Data Mega Projects Data
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Foundation of Data Science unit number two notes
Computer network topology notes for revision
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Business Data Analytics.
Clinical guidelines as a resource for EBP(1).pdf
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Extract, Transform and Load.pptx

  • 2. E T L JESUSA LAO ESPELETA By:
  • 3. Extracting Transforming Loading ETL, what is it? • ETL (Extract-Transform-Load) is a data integration process that integrates data from numerous data sources into a single, consistent store that is then put into a data warehouse or other destination system.
  • 4. What does ETL do? • Migrates data from one DB to another • Converts DB from one format or type to another • reads data from an input source ; • Performs at least three specific functions • Enables loading of multiple target databases • Forms data marts and data warehouses • Transforms the data to make it accessible to business analysis • writes the resultant data set back out to a flat file, relational table, etc.
  • 5. ETL how it works in Data Warehouse?
  • 6. EXTRACT Raw data is copied or exported from source locations to a staging area during data extraction. TRANSFORM In this stage, raw data undergoes data processing. Also in this step, the data is processed and consolidated for its intended analytical use case. LOAD The converted data is moved from the staging area into the target data warehouse in this final stage.
  • 7. Incremental extract = capturing changes that have occurred since the last static extract. ETLProcess–Step1: Operational System Enterprise Data Warehouse Data Reconciliation Capture = extract… obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse. Static extract = capturing a snapshot of the source data at a point in time. capture..
  • 8. Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging & locating missing data. ETLProcess–Step2: Operational System Enterprise Data Warehouse Data Reconciliation Scrub = cleanse… uses pattern recognition and AI techniques to upgrade data quality. Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data & inconsistencies. capture.. scrub ..
  • 9. Field-level: SINGLE FIELD – from one field to one field MULTI FIELD – from many fields to one, or field to many ETLProcess–Step3: Operational System Enterprise Data Warehouse Data Reconciliation Transform = convert data from format of operational system to format of data warehouse. Record-level: SELECTION – data partitioning JOINING – data combining AGGREGATION – data summarization capture.. scrub .. transform ..
  • 10. Update mode: only changes in source data are written to data warehouse. ETLProcess–Step4: Operational System Enterprise Data Warehouse Data Reconciliation Load/Index= place transformed data into the warehouse and create indexes. Refresh mode: bulk rewriting of target data at periodic intervals. capture.. scrub .. transform .. NOTE • The staging area should be accessed only by the ETL process. It should never be available to anyone else, particularly not to end users as it is not intended for data presentation to the end-user as may contain an incomplete data.
  • 11. Types of ETL Testing Production Validation Testing Testing Process Source to TargetTesting (ValidationTesting) Application Upgrades Metadata Testing Data Completeness Testing Data Quality Testing Data AccuracyTesting GUI/Navigation Testing Data Validation Option provides the ETL testing automation and management capabilities to ensure that production systems are not compromised by the data. Such type of testing is carried out to validate whether the data values transformed are the expected data values. Such type of ETL testing can be automatically generated, saving substantial test development time. This type of testing checks whether the data extracted from an older application or repository are exactly same as the data in a repository or new application. Metadata testing includes testing of data type check, data length check and index/constraint check. To verify that all the expected data is loaded in target from the source, data completeness testing is done. Some of the tests that can be run are compare and validate counts, aggregates and actual data between the source and target for columns with simple transformation or no transformation. Data quality testing includes number check, date check, precision check, data check , null check etc. This testing is done to ensure that the data is accurately loaded and transformed as expected. This testing is done to check the navigation or GUI aspects of the front end reports.
  • 12. operational system customers communications SQL flatfiles Data Sources DataValidation DataCleaning Data Transforming Data Aggregating DataLoading Data warehouse D a ta m a rts BI Results Simplified Data Flow OLAP Analysis Data Mining Data Visualization Reports Dashboards Alerts ETL Enduser
  • 13. Difference between ETL Testing and Database Testing ETL TESTING DATABASE TESTING Verifies whether data is moved as expected. Verifies whether counts in the source and target are matching. Verifies whether the data transformed is as per expectation. Verifies that the foreign primary key relations are preserved during the ETL. Verifies for duplication in loaded data. The primary goal is to check if the data is following the rules/ standards defined in the Data Model. Verify that there are no orphan records and foreign-primary key relations are maintained. Verifies that there are no redundant tables and database is optimally normalized. Verify if data is missing in columns where required.
  • 15. ETLtools have been available for more than 30 yrs.Varies types of solution also have been in the markettechnology. COMMERCIAL OPEN SOURCE- BASED
  • 17. • A Push strategy is initiated by the Source System. As a part of the Extraction Process, the source data can be pushed/exported or dumped onto a file location from where it can loaded into a staging area. • A Pull strategy is initiated by the Target System. As a part of the Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources. No additional burden on the Transactional systems when we want to load data into the staging database. Additional space required to store the data that needs to be loaded into to the staging database. No additional space required to store the data that needs to be loaded into to the staging database. Burden on the Transactional systems when we want to load data into the staging database.
  • 18. 5 Reasons for “DIRTY” Data : 1. FORMATTING ISSUES Duplicate data is the bane of data analysts and data scientists and can lead to dire consequences if reported on- -as numbers will be over-inflated. 3. IMPROPER INCLUSION 4. INCOMPLETE DATA 5. CONTRADICTORY DATA. If your data formatting isn’t uniform across your whole set, you’re likely to run into some serious problems when it comes time to crunch the numbers. 2. DUPLICATE DATA Is less likely to come from transfer or formatting issues and instead to issues with data entry — data that has been entered into the wrong category, with the wrong units, or comes from the wrong timespan. An incomplete data set means that you’re not able to see the full picture of whatever you’re analyzing. If you’re compiling data in an attempt to calculate your business’s revenue, you don’t want your expense numbers included in the calculations.
  • 19. Why is ETL important? • Removes mistakes and corrects data. • Captures the flow of transactional data. • Adjusts data from multiple sources to be used together. • Enables subsequent business / analyticaldata processing.
  • 20. What can ETL be used? • To acquire a temporarysubsetof data (like a VIEW) for reportsor otherpurposes.A more permanent dataset may beacquired for otherpurposes suchas the population ofa data mart or data warehouse.
  • 21. Question: Since the ETL provides a mini-data-warehousecomponent that looks remarkably like the data mart and performall the data extraction, filtering, integration, classification and aggregation functions that the data warehouse normally provides, why we need a extra data warehouse as an duplicated part? In Fact, when properly implemented, the data warehouse performs all data preparation function instead of letting ETL performthose chores, so there is no duplication of function. Better yet, the data warehouse handles the data component much more efficiently than ETL does, so we can appreciate the benefits of having a central data warehouse serve as the large enterprise decision support database. Moreover, to provide better performance, ETL merge the data warehouse and data mart approaches by storing small extracts of the data warehouse at end-user workstations.