SlideShare a Scribd company logo
3
Most read
11
Most read
13
Most read
Er. Nawaraj Bhandari
Data Warehouse/Data Mining
Chapter 4:
Data Warehouse Back End Tools
Data warehousing systems use various data extraction and cleaning tools, and
load and refresh utilities for populating data warehouses.
 Data Cleaning
 Load
 Refresh
Data Cleaning
The data warehouse involves large volumes of data from multiple sources, which
can lead to a high probability of errors and anomalies in the data.
Inconsistent field lengths, inconsistent descriptions, inconsistent value
assignments, missing entries and violation of integrity constraints are some of the
examples.
The three classes of data cleaning tools are popularly used to help detect data
anomalies and correct them:
 Data migration tools allow simple transformation rules to be specified.
 Data scrubbing tools use domain-specific knowledge to do the scrubbing of
data. Such tools help to maintain integrity of an entity.
 Data auditing tools make it possible to discover rules and relationships by
scanning data. Thus, such tools may be considered variants of data mining
tools.
Load
 After extracting, cleaning, and transforming, data will be loaded into the data
warehouse. A load utility has to allow the system administrator to monitor
status, to cancel, suspend and resume a load, and to restart after failure with no
loss of data integrity.
 Sequential loads can take a very long time to complete especially when it deals
with terabytes of data. Therefore, pipelined and partitioned parallelism are
typically used.
 Also incremental loading over full load is more popularly used with most
commercial utilities since it reduces the volume of data that has to be
incorporated into the data warehouse.
Refresh
 Refreshing a warehouse consists in propagating updates on source data to
correspondingly update the base data and derived data stored in the
warehouse.
 There are two sets of issues to consider: when to refresh, and how to refresh.
Usually, the warehouse is refreshed periodically (e.g., daily or weekly).
 Only if some OLAP queries need current data, it is necessary to propagate every
update. The refresh policy is set by the warehouse administrator, depending on
user needs and may be different for different sources.
Data Warehouse Tuning
The process of applying different strategies in performing different operations of
data warehouse such that performance measures will enhance is called data
warehousing tuning.
For this, it is very important to have a complete knowledge of data warehouse.
We can tune the different aspects of a data warehouse such as performance, data
load, queries, etc
Data Warehouse Tuning
Tuning a data warehouse is a difficult procedure due to following:
 Data warehouse is dynamic; it never remains constant.
 It is very difficult to predict what query the user is going to post in the future.
 Business requirements change with time.
 Users and their profiles keep changing.
 The user can switch from one group to another.
 The data load on the warehouse also changes with time.
Computation of Data Cubes
Data warehouses contain huge volumes of data.
 OLAP servers demand that decision support queries to answered in the order of
seconds.
 It is crucial for data warehouse systems to support highly efficient cube
computation techniques, access methods and query processing techniques.
Efficient Computation of Data
Cubes
 At the core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions.
 In SQL’s terms, these aggregations are referred to as group-by’s.
 Each group-by can be represented by a cuboid.
 Where the set of group-by’s forms a lattice of cuboids defining a data cube.
Data Warehouse - Testing
Testing is very important for data warehouse systems to make them work
correctly and efficiently. There are three basic levels of testing performed on a
data warehouse:
 Unit testing
 Integration testing
 System testing
Unit Testing
 In unit testing, each component is separately tested.
 Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
 This test is performed by the developer.
Integration Testing
 In integration testing, the various modules of the application are brought
together and then tested against the number of inputs.
 It is performed to test whether the various components do well after integration.
System Testing
 In system testing, the whole data warehouse application is tested together.
 The purpose of system testing is to check whether the entire system works
correctly together or not.
 System testing is performed by the testing team.
 Since the size of the whole data warehouse is very large, it is usually possible to
perform minimal system testing before the test plan can be enacted.
References
1. Sam Anahory, Dennis Murray, “Data warehousing In the Real World”, Pearson
Education.
2. Kimball, R. “The Data Warehouse Toolkit”, Wiley, 1996.
3. Teorey, T. J., “Database Modeling and Design: The Entity-Relationship Approach”,
Morgan Kaufmann Publishers, Inc., 1990.
4. “An Overview of Data Warehousing and OLAP Technology”, S. Chaudhuri,
Microsoft Research
5. “Data Warehousing with Oracle”, M. A. Shahzad
6. “Data Mining Concepts and Techniques”, Morgan Kaufmann J. Han, M Kamber
Second Edition ISBN : 978-1-55860-901-3
ANY QUESTIONS?

More Related Content

PPTX
DATA WAREHOUSING
PDF
Confusion Matrix
PPT
Datawarehousing and Business Intelligence
PPTX
PPTX
Schemas for multidimensional databases
PPT
Business intelligence
PPTX
Temporal databases
PPTX
Data mining primitives
DATA WAREHOUSING
Confusion Matrix
Datawarehousing and Business Intelligence
Schemas for multidimensional databases
Business intelligence
Temporal databases
Data mining primitives

What's hot (20)

PPTX
Data mining: Classification and prediction
PPTX
Data Analytics Life Cycle
PPT
Chapter 4-Naming.ppt
PPTX
Multidimensional schema of data warehouse
PPTX
DATA WAREHOUSING
PPT
UML Diagrams
PPTX
What Is Unstructured Data And Why Is It So Important To Businesses?
PDF
Input Space Partitioning
PPTX
Comparison with Traditional databases
PPTX
Big Data
PDF
Lecture4 big data technology foundations
PPTX
two tier and three tier
PPTX
Automatic indexing
PDF
How to create a successful data archiving strategy for your Salesforce Org.
PPTX
Kdd process
PPT
1.4 data warehouse
PPTX
03. Data Exploration.pptx
PPT
Query processing-and-optimization
PPTX
Lec1,2
PPTX
How use weka tool
Data mining: Classification and prediction
Data Analytics Life Cycle
Chapter 4-Naming.ppt
Multidimensional schema of data warehouse
DATA WAREHOUSING
UML Diagrams
What Is Unstructured Data And Why Is It So Important To Businesses?
Input Space Partitioning
Comparison with Traditional databases
Big Data
Lecture4 big data technology foundations
two tier and three tier
Automatic indexing
How to create a successful data archiving strategy for your Salesforce Org.
Kdd process
1.4 data warehouse
03. Data Exploration.pptx
Query processing-and-optimization
Lec1,2
How use weka tool
Ad

Similar to Data warehouse testing (20)

PPT
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
PPTX
Data warehousing
PPTX
158001210111bapan data warehousepptse.pptx
PDF
Data warehousing interview questions
PPT
Data mining presentation for OLAP and other details
PPT
PDF
PDF
6566tyyht656ty55hyhghghghghghg04OLAP.pdf
PPTX
DATA WAREHOUSING.2.pptx
PPTX
04OLAPV2 from the course data warehousing
PPT
Data Mining and Warehousing Concept and Techniques
PPT
04OLAP in data mining concept Online Analytical Processing.ppt
PPT
Data Mining Concept & Technique-ch04.ppt
PDF
Bahir Dar Data Mining Lab-Weka Edited.pdf
PDF
data warehousing and online analtytical processing
PPT
Topic(4)-OLAP data mining master ALEX.ppt
PPT
data warehouse and data mining unit 2 ppt
PPTX
An overview of data warehousing and OLAP technology
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Data warehousing
158001210111bapan data warehousepptse.pptx
Data warehousing interview questions
Data mining presentation for OLAP and other details
6566tyyht656ty55hyhghghghghghg04OLAP.pdf
DATA WAREHOUSING.2.pptx
04OLAPV2 from the course data warehousing
Data Mining and Warehousing Concept and Techniques
04OLAP in data mining concept Online Analytical Processing.ppt
Data Mining Concept & Technique-ch04.ppt
Bahir Dar Data Mining Lab-Weka Edited.pdf
data warehousing and online analtytical processing
Topic(4)-OLAP data mining master ALEX.ppt
data warehouse and data mining unit 2 ppt
An overview of data warehousing and OLAP technology
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Ad

More from Er. Nawaraj Bhandari (20)

PPTX
Data mining approaches and methods
PPTX
Research trends in data warehousing and data mining
PPTX
Mining Association Rules in Large Database
PPTX
Introduction to data mining and data warehousing
PPTX
Data warehouse physical design
PPTX
Data warehouse logical design
PPTX
Classification and prediction in data mining
PPTX
Chapter 3: Simplification of Boolean Function
PPTX
Chapter 6: Sequential Logic
PPTX
Chapter 5: Cominational Logic with MSI and LSI
PPTX
Chapter 4: Combinational Logic
PPTX
Chapter 2: Boolean Algebra and Logic Gates
PPTX
Chapter 1: Binary System
PPTX
Introduction to Electronic Commerce
PPT
Evaluating software development
PPT
Using macros in microsoft excel part 2
PPT
Using macros in microsoft excel part 1
PPTX
Using macros in microsoft access
PPTX
Testing software development
PPTX
Application software and business processes
Data mining approaches and methods
Research trends in data warehousing and data mining
Mining Association Rules in Large Database
Introduction to data mining and data warehousing
Data warehouse physical design
Data warehouse logical design
Classification and prediction in data mining
Chapter 3: Simplification of Boolean Function
Chapter 6: Sequential Logic
Chapter 5: Cominational Logic with MSI and LSI
Chapter 4: Combinational Logic
Chapter 2: Boolean Algebra and Logic Gates
Chapter 1: Binary System
Introduction to Electronic Commerce
Evaluating software development
Using macros in microsoft excel part 2
Using macros in microsoft excel part 1
Using macros in microsoft access
Testing software development
Application software and business processes

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Lecture1 pattern recognition............
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Launch Your Data Science Career in Kochi – 2025
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Lecture1 pattern recognition............
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Business Ppt On Nestle.pptx huunnnhhgfvu
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Foundation of Data Science unit number two notes
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Data warehouse testing

  • 1. Er. Nawaraj Bhandari Data Warehouse/Data Mining Chapter 4:
  • 2. Data Warehouse Back End Tools Data warehousing systems use various data extraction and cleaning tools, and load and refresh utilities for populating data warehouses.  Data Cleaning  Load  Refresh
  • 3. Data Cleaning The data warehouse involves large volumes of data from multiple sources, which can lead to a high probability of errors and anomalies in the data. Inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints are some of the examples. The three classes of data cleaning tools are popularly used to help detect data anomalies and correct them:  Data migration tools allow simple transformation rules to be specified.  Data scrubbing tools use domain-specific knowledge to do the scrubbing of data. Such tools help to maintain integrity of an entity.  Data auditing tools make it possible to discover rules and relationships by scanning data. Thus, such tools may be considered variants of data mining tools.
  • 4. Load  After extracting, cleaning, and transforming, data will be loaded into the data warehouse. A load utility has to allow the system administrator to monitor status, to cancel, suspend and resume a load, and to restart after failure with no loss of data integrity.  Sequential loads can take a very long time to complete especially when it deals with terabytes of data. Therefore, pipelined and partitioned parallelism are typically used.  Also incremental loading over full load is more popularly used with most commercial utilities since it reduces the volume of data that has to be incorporated into the data warehouse.
  • 5. Refresh  Refreshing a warehouse consists in propagating updates on source data to correspondingly update the base data and derived data stored in the warehouse.  There are two sets of issues to consider: when to refresh, and how to refresh. Usually, the warehouse is refreshed periodically (e.g., daily or weekly).  Only if some OLAP queries need current data, it is necessary to propagate every update. The refresh policy is set by the warehouse administrator, depending on user needs and may be different for different sources.
  • 6. Data Warehouse Tuning The process of applying different strategies in performing different operations of data warehouse such that performance measures will enhance is called data warehousing tuning. For this, it is very important to have a complete knowledge of data warehouse. We can tune the different aspects of a data warehouse such as performance, data load, queries, etc
  • 7. Data Warehouse Tuning Tuning a data warehouse is a difficult procedure due to following:  Data warehouse is dynamic; it never remains constant.  It is very difficult to predict what query the user is going to post in the future.  Business requirements change with time.  Users and their profiles keep changing.  The user can switch from one group to another.  The data load on the warehouse also changes with time.
  • 8. Computation of Data Cubes Data warehouses contain huge volumes of data.  OLAP servers demand that decision support queries to answered in the order of seconds.  It is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods and query processing techniques.
  • 9. Efficient Computation of Data Cubes  At the core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions.  In SQL’s terms, these aggregations are referred to as group-by’s.  Each group-by can be represented by a cuboid.  Where the set of group-by’s forms a lattice of cuboids defining a data cube.
  • 10. Data Warehouse - Testing Testing is very important for data warehouse systems to make them work correctly and efficiently. There are three basic levels of testing performed on a data warehouse:  Unit testing  Integration testing  System testing
  • 11. Unit Testing  In unit testing, each component is separately tested.  Each module, i.e., procedure, program, SQL Script, Unix shell is tested.  This test is performed by the developer.
  • 12. Integration Testing  In integration testing, the various modules of the application are brought together and then tested against the number of inputs.  It is performed to test whether the various components do well after integration.
  • 13. System Testing  In system testing, the whole data warehouse application is tested together.  The purpose of system testing is to check whether the entire system works correctly together or not.  System testing is performed by the testing team.  Since the size of the whole data warehouse is very large, it is usually possible to perform minimal system testing before the test plan can be enacted.
  • 14. References 1. Sam Anahory, Dennis Murray, “Data warehousing In the Real World”, Pearson Education. 2. Kimball, R. “The Data Warehouse Toolkit”, Wiley, 1996. 3. Teorey, T. J., “Database Modeling and Design: The Entity-Relationship Approach”, Morgan Kaufmann Publishers, Inc., 1990. 4. “An Overview of Data Warehousing and OLAP Technology”, S. Chaudhuri, Microsoft Research 5. “Data Warehousing with Oracle”, M. A. Shahzad 6. “Data Mining Concepts and Techniques”, Morgan Kaufmann J. Han, M Kamber Second Edition ISBN : 978-1-55860-901-3