SlideShare a Scribd company logo
Data Warehouses Dr S.Natarajan
Introduction
A Brief History of  Information Technology The “dark ages”: paper forms in file cabinets Computerized systems emerge Initially for big projects like Social Security Same functionality as old paper-based systems The “golden age”:  databases are everywhere Most activities tracked electronically Stored data provides detailed history of activity The next step:  use data for decision-making The focus of this course! Made possible by omnipresence of IT Identify inefficiencies in current processes Quantify likely impact of decisions
Databases for Decision Support 1 st  phase:  Automating existing processes makes them more efficient. Automation  -> Lots of well-organized, easily accessed data 2 nd  phase:  Data analysis allows for better decision-making.  Analyze data  ->  better understanding Better understanding  ->  better decisions “ Data Entry” vs. “Thinking” Data analysts are decision-makers:  managers, executives, etc.
Databases for Decision Support 1 st  phase:  Automating existing processes makes them more efficient. Automation  -> Lots of well-organized, easily accessed data 2 nd  phase:  Data analysis allows for better decision-making.  Analyze data  ->  better understanding Better understanding  ->  better decisions “ Data Entry” vs. “Thinking” Data analysts are decision-makers:  managers, executives, etc.
OLTP vs. OLAP OLTP:  O n- L ine  T ransaction  P rocessing Many short transactions (queries + updates) Examples:  Update account balance Enroll in course Add book to shopping cart Queries touch small amounts of data (one record or a few records) Updates are frequent Concurrency is biggest performance concern OLAP:  O n- L ine  A nalytical  P rocessing Long transactions, complex queries Examples:  Report total sales for each department in each month Identify top-selling books Count classes with fewer than 10 students Queries touch large amounts of data Updates are infrequent Individual queries can require lots of resources
Why OLAP & OLTP don’t mix (1) Transaction processing (OLTP): Fast response time  important (< 1 second) Data must be  up-to-date, consistent  at all times Data analysis (OLAP): Queries can consume lots of resources Can  saturate CPUs and disk bandwidth Operating on static “snapshot” of data usually OK OLAP can “crowd out” OLTP transactions Transactions are slow -> unhappy users Example:  Analysis query asks for sum of all sales Acquires lock on sales table for consistency New sales transaction is blocked Different performance requirements
Why OLAP & OLTP don’t mix (2) Transaction processing (OLTP): Normalized  schema for consistency Complex data models, many tables Limited number of  standardized queries   and updates Data analysis (OLAP): Simplicity  of data model is important Allow semi-technical users to formulate  ad hoc   queries De-normalized  schemas are common Fewer joins -> improved query performance Fewer tables -> schema is easier to understand Different data modeling requirements
Why OLAP & OLTP don’t mix (3) An OLTP system targets one specific process For example:  ordering from an online store OLAP integrates data from different processes Combine sales, inventory, and purchasing data Analyze experiments conducted by different labs OLAP often makes use of historical data Identify long-term patterns Notice changes in behavior over time Terminology, schemas vary across data sources Integrating data from disparate sources is a major challenge Analysis requires data from many sources
A data warehouse is a collection of integrated databases designed to support a DSS. An operational data store (ODS) stores data for a specific application.  It feeds the data warehouse a stream of desired raw data. A data mart is a lower-cost, scaled-down version of a data warehouse, usually designed to support a small group of users (rather than the entire firm). The metadata is information that is kept about the warehouse.
Organizational Data Flow and Data Storage Components
Loading the Data Warehouse Source Systems Data Staging Area Data Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse
Characteristics of a Data Warehouse Subject oriented  – organized based on use Integrated  – inconsistencies removed Nonvolatile  – stored in read-only format Time variant  – data are normally time series Summarized  – in decision-usable format  Large volume  – data sets are quite large Non normalized  – often redundant Metadata  – data about data are stored Data sources  – comes from nonintegrated sources
A Data Warehouse is  Subject  Oriented
Data in a Data Warehouse are Integrated
The Data Warehouse Architecture The architecture consists of various interconnected elements: Operational and external database layer  – the source data for the DW Information access layer  – the tools the end user access to extract and analyze the data Data access layer  – the interface between the operational and information access layers Metadata layer  – the data directory or repository of metadata information
The Data Warehouse Architecture (cont.) Additional layers are: Process management layer  – the scheduler or job controller Application messaging layer  – the “middleware” that transports information around the firm Physical data warehouse layer  – where the actual data used in the DSS are located Data staging layer  – all of the processes necessary to select, edit, summarize and load warehouse data from the operational and external data bases
Components of the Data   Warehouse   Architecture
Data   Warehousing   Typology The virtual data warehouse  – the end users have direct access to the data stores, using tools enabled at the data access layer The central data warehouse  – a single physical database contains all of the data for a specific functional area The distributed data warehouse  – the components are distributed across several physical databases
Data Have Data -- The Metadata The name suggests some high-level technological concept, but it really is fairly simple.  Metadata is “data about data”. With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe. Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data.
The Metadata in Action The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data. For example, a line in a sales database may contain:  1023  K596  111.50 This is mostly meaningless until we consult the metadata (in the data directory) that tells us it was store number 1023, product K596 and sales of  Rs 111.50.
Implementing the Data Warehouse Kozar assembled a list of “seven deadly sins” of data warehouse implementation: “ If you build it, they will come”  – the DW needs to be designed to meet people’s needs Omission of an architectural framework  – you need to consider the number of users, volume of data, update cycle, etc. Underestimating the importance of documenting assumptions  – the assumptions and potential conflicts must be included in the framework
“ Seven Deadly Sins”, continued Failure to use the right tool  – a DW project needs different tools than those used to develop an application Life cycle abuse  – in a DW, the life cycle really never ends Ignorance about data conflicts  – resolving these takes a lot more effort than most people realize Failure to learn from mistakes  – since one DW project tends to beget another, learning from the early mistakes will yield higher quality later
The Future of Data Warehousing As the DW becomes a standard part of an organization, there will be efforts to find new ways to use the data.  This will likely bring with it several new challenges : Regulatory constraints  may limit the ability to combine sources of disparate data. These disparate sources are likely to contain  unstructured data , which is hard to store. The  Internet  makes it possible to access data from virtually “anywhere”.  Of course, this just increases the disparity.
Data Integration is Hard Data warehouses combine data from multiple sources Data must be translated into a consistent format Data integration represents ~80% of effort for a typical data warehouse project! Some reasons why it’s hard: Metadata is poor or non-existent Data quality is often bad Missing or default values Multiple spellings of the same thing  (Cal vs. UC Berkeley vs. University of California) Inconsistent semantics What is an airline passenger?
Federated Databases An alternative to data warehouses Data warehouse Create a copy of all the data  Execute queries against the copy Federated database  Pull data from source systems as needed to answer queries “ lazy” vs. “eager” data integration Data Warehouse Federated Database Source Systems Source Systems Warehouse Mediator Query Answer Query Extraction Rewritten  Queries Answer
Warehouses vs. Federation Advantages of federated databases: No redundant copying of data Queries see “real-time” view of evolving data More flexible security policy Disadvantages of federated databases: Analysis queries place extra load on transactional systems Query optimization is hard to do well Historical data may not be available Complex “wrappers” needed to mediate between analysis server and source systems Data warehouses are much more common in practice Better performance Lower complexity Slightly out-of-date data is acceptable
Visit  www.jsbi.blogspot.com  for more slides/information!! Mail :  [email_address]

More Related Content

PPTX
PPTX
Intro to Big Data and NoSQL
ODP
Introduction To Data Warehousing
PPTX
Data warehousing
PPTX
DATA WAREHOUSING
PPTX
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
PPTX
Data warehousing
PPTX
Data Mining
Intro to Big Data and NoSQL
Introduction To Data Warehousing
Data warehousing
DATA WAREHOUSING
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data warehousing
Data Mining

What's hot (20)

PPS
Data Warehouse 101
PDF
OLTP vs OLAP
PPTX
Data warehouse
PPTX
OLAP & DATA WAREHOUSE
PPTX
Building a modern data warehouse
PPTX
Data mining , Knowledge Discovery Process, Classification
PPTX
In-memory Databases
PPT
1.4 data warehouse
PPTX
Big Data Open Source Technologies
PPTX
Data warehouse architecture
PDF
Big data unit i
PPTX
Data warehouse
PPT
Data warehouse
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
PPTX
Oltp vs olap
PPT
Data preprocessing
PPT
DATA WAREHOUSING AND DATA MINING
PDF
MariaDB Performance Tuning and Optimization
PPT
INTRODUCTION TO DATABASE
PPTX
Big Data - Applications and Technologies Overview
Data Warehouse 101
OLTP vs OLAP
Data warehouse
OLAP & DATA WAREHOUSE
Building a modern data warehouse
Data mining , Knowledge Discovery Process, Classification
In-memory Databases
1.4 data warehouse
Big Data Open Source Technologies
Data warehouse architecture
Big data unit i
Data warehouse
Data warehouse
Data Warehousing Trends, Best Practices, and Future Outlook
Oltp vs olap
Data preprocessing
DATA WAREHOUSING AND DATA MINING
MariaDB Performance Tuning and Optimization
INTRODUCTION TO DATABASE
Big Data - Applications and Technologies Overview
Ad

Viewers also liked (20)

PPTX
DATA WAREHOUSING
PPT
Data Warehousing and Data Mining
PDF
Data warehouse architecture
DOC
Data warehouse concepts
PPTX
Data mining
PDF
Introduction to Data Warehousing
PDF
Data Warehousing 2016
DOCX
Components of a Data-Warehouse
PPTX
Building an Effective Data Warehouse Architecture
PPT
Data mining slides
 
PPTX
3 tier data warehouse
 
PDF
Multidimentional data model
PPT
Data Mining Concepts
PPT
Data Warehouse Modeling
PPTX
Introduction to Data Warehousing
PPTX
Organic Terrace Gardening by Jason
PDF
Data Warehouse Design and Best Practices
PPTX
What is a Data Warehouse and How Do I Test It?
ODP
Data warehouse inmon versus kimball 2
PPTX
Data mart
DATA WAREHOUSING
Data Warehousing and Data Mining
Data warehouse architecture
Data warehouse concepts
Data mining
Introduction to Data Warehousing
Data Warehousing 2016
Components of a Data-Warehouse
Building an Effective Data Warehouse Architecture
Data mining slides
 
3 tier data warehouse
 
Multidimentional data model
Data Mining Concepts
Data Warehouse Modeling
Introduction to Data Warehousing
Organic Terrace Gardening by Jason
Data Warehouse Design and Best Practices
What is a Data Warehouse and How Do I Test It?
Data warehouse inmon versus kimball 2
Data mart
Ad

Similar to Introduction to Data Warehousing (20)

PPT
E06WarehouseDesignissuesindatawarehousedesign.ppt
PPT
E06WarehouseDesign.pptxkjhjkljhlkjhlkhlkj
PPT
Datawarehousing
PPT
Data Warehouse
PDF
BI Chapter 03.pdf business business business business business business
PPT
DW 101
PPTX
Data warehouse
PPTX
DATAWAREHOUSE MAIn under data mining for
PPT
1-_Intro_to_Data_Minning__DWH.ppt
PPT
Datawarehousing
PPT
Data Warehouse Basic Guide
PPT
Etl data processing system which is very useful for the engineering students
PPT
Data warehouse
PPT
Introduction to ETL Data Warehousing.ppt
PPT
dw_concepts_2_day_course.ppt
PPT
Behind The Scenes Databases And Information Systems 6
PPT
Final presentation
PPT
D01 etl
PDF
Advances And Research Directions In Data-Warehousing Technology
PPT
Chapter 2-data-warehousingppt2517 vero
E06WarehouseDesignissuesindatawarehousedesign.ppt
E06WarehouseDesign.pptxkjhjkljhlkjhlkhlkj
Datawarehousing
Data Warehouse
BI Chapter 03.pdf business business business business business business
DW 101
Data warehouse
DATAWAREHOUSE MAIn under data mining for
1-_Intro_to_Data_Minning__DWH.ppt
Datawarehousing
Data Warehouse Basic Guide
Etl data processing system which is very useful for the engineering students
Data warehouse
Introduction to ETL Data Warehousing.ppt
dw_concepts_2_day_course.ppt
Behind The Scenes Databases And Information Systems 6
Final presentation
D01 etl
Advances And Research Directions In Data-Warehousing Technology
Chapter 2-data-warehousingppt2517 vero

Recently uploaded (20)

PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
How to Get Funding for Your Trucking Business
PDF
IFRS Notes in your pocket for study all the time
PDF
Chapter 5_Foreign Exchange Market in .pdf
PPTX
HR Introduction Slide (1).pptx on hr intro
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
Training And Development of Employee .pdf
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PDF
Laughter Yoga Basic Learning Workshop Manual
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PPT
Data mining for business intelligence ch04 sharda
PDF
Outsourced Audit & Assurance in USA Why Globus Finanza is Your Trusted Choice
PPTX
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
MSPs in 10 Words - Created by US MSP Network
PPTX
Principles of Marketing, Industrial, Consumers,
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
How to Get Funding for Your Trucking Business
IFRS Notes in your pocket for study all the time
Chapter 5_Foreign Exchange Market in .pdf
HR Introduction Slide (1).pptx on hr intro
unit 1 COST ACCOUNTING AND COST SHEET
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Training And Development of Employee .pdf
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
Belch_12e_PPT_Ch18_Accessible_university.pptx
Laughter Yoga Basic Learning Workshop Manual
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
Data mining for business intelligence ch04 sharda
Outsourced Audit & Assurance in USA Why Globus Finanza is Your Trusted Choice
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
MSPs in 10 Words - Created by US MSP Network
Principles of Marketing, Industrial, Consumers,
Nidhal Samdaie CV - International Business Consultant
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry

Introduction to Data Warehousing

  • 1. Data Warehouses Dr S.Natarajan
  • 3. A Brief History of Information Technology The “dark ages”: paper forms in file cabinets Computerized systems emerge Initially for big projects like Social Security Same functionality as old paper-based systems The “golden age”: databases are everywhere Most activities tracked electronically Stored data provides detailed history of activity The next step: use data for decision-making The focus of this course! Made possible by omnipresence of IT Identify inefficiencies in current processes Quantify likely impact of decisions
  • 4. Databases for Decision Support 1 st phase: Automating existing processes makes them more efficient. Automation -> Lots of well-organized, easily accessed data 2 nd phase: Data analysis allows for better decision-making. Analyze data -> better understanding Better understanding -> better decisions “ Data Entry” vs. “Thinking” Data analysts are decision-makers: managers, executives, etc.
  • 5. Databases for Decision Support 1 st phase: Automating existing processes makes them more efficient. Automation -> Lots of well-organized, easily accessed data 2 nd phase: Data analysis allows for better decision-making. Analyze data -> better understanding Better understanding -> better decisions “ Data Entry” vs. “Thinking” Data analysts are decision-makers: managers, executives, etc.
  • 6. OLTP vs. OLAP OLTP: O n- L ine T ransaction P rocessing Many short transactions (queries + updates) Examples: Update account balance Enroll in course Add book to shopping cart Queries touch small amounts of data (one record or a few records) Updates are frequent Concurrency is biggest performance concern OLAP: O n- L ine A nalytical P rocessing Long transactions, complex queries Examples: Report total sales for each department in each month Identify top-selling books Count classes with fewer than 10 students Queries touch large amounts of data Updates are infrequent Individual queries can require lots of resources
  • 7. Why OLAP & OLTP don’t mix (1) Transaction processing (OLTP): Fast response time important (< 1 second) Data must be up-to-date, consistent at all times Data analysis (OLAP): Queries can consume lots of resources Can saturate CPUs and disk bandwidth Operating on static “snapshot” of data usually OK OLAP can “crowd out” OLTP transactions Transactions are slow -> unhappy users Example: Analysis query asks for sum of all sales Acquires lock on sales table for consistency New sales transaction is blocked Different performance requirements
  • 8. Why OLAP & OLTP don’t mix (2) Transaction processing (OLTP): Normalized schema for consistency Complex data models, many tables Limited number of standardized queries and updates Data analysis (OLAP): Simplicity of data model is important Allow semi-technical users to formulate ad hoc queries De-normalized schemas are common Fewer joins -> improved query performance Fewer tables -> schema is easier to understand Different data modeling requirements
  • 9. Why OLAP & OLTP don’t mix (3) An OLTP system targets one specific process For example: ordering from an online store OLAP integrates data from different processes Combine sales, inventory, and purchasing data Analyze experiments conducted by different labs OLAP often makes use of historical data Identify long-term patterns Notice changes in behavior over time Terminology, schemas vary across data sources Integrating data from disparate sources is a major challenge Analysis requires data from many sources
  • 10. A data warehouse is a collection of integrated databases designed to support a DSS. An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data. A data mart is a lower-cost, scaled-down version of a data warehouse, usually designed to support a small group of users (rather than the entire firm). The metadata is information that is kept about the warehouse.
  • 11. Organizational Data Flow and Data Storage Components
  • 12. Loading the Data Warehouse Source Systems Data Staging Area Data Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse
  • 13. Characteristics of a Data Warehouse Subject oriented – organized based on use Integrated – inconsistencies removed Nonvolatile – stored in read-only format Time variant – data are normally time series Summarized – in decision-usable format Large volume – data sets are quite large Non normalized – often redundant Metadata – data about data are stored Data sources – comes from nonintegrated sources
  • 14. A Data Warehouse is Subject Oriented
  • 15. Data in a Data Warehouse are Integrated
  • 16. The Data Warehouse Architecture The architecture consists of various interconnected elements: Operational and external database layer – the source data for the DW Information access layer – the tools the end user access to extract and analyze the data Data access layer – the interface between the operational and information access layers Metadata layer – the data directory or repository of metadata information
  • 17. The Data Warehouse Architecture (cont.) Additional layers are: Process management layer – the scheduler or job controller Application messaging layer – the “middleware” that transports information around the firm Physical data warehouse layer – where the actual data used in the DSS are located Data staging layer – all of the processes necessary to select, edit, summarize and load warehouse data from the operational and external data bases
  • 18. Components of the Data Warehouse Architecture
  • 19. Data Warehousing Typology The virtual data warehouse – the end users have direct access to the data stores, using tools enabled at the data access layer The central data warehouse – a single physical database contains all of the data for a specific functional area The distributed data warehouse – the components are distributed across several physical databases
  • 20. Data Have Data -- The Metadata The name suggests some high-level technological concept, but it really is fairly simple. Metadata is “data about data”. With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe. Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data.
  • 21. The Metadata in Action The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data. For example, a line in a sales database may contain: 1023 K596 111.50 This is mostly meaningless until we consult the metadata (in the data directory) that tells us it was store number 1023, product K596 and sales of Rs 111.50.
  • 22. Implementing the Data Warehouse Kozar assembled a list of “seven deadly sins” of data warehouse implementation: “ If you build it, they will come” – the DW needs to be designed to meet people’s needs Omission of an architectural framework – you need to consider the number of users, volume of data, update cycle, etc. Underestimating the importance of documenting assumptions – the assumptions and potential conflicts must be included in the framework
  • 23. “ Seven Deadly Sins”, continued Failure to use the right tool – a DW project needs different tools than those used to develop an application Life cycle abuse – in a DW, the life cycle really never ends Ignorance about data conflicts – resolving these takes a lot more effort than most people realize Failure to learn from mistakes – since one DW project tends to beget another, learning from the early mistakes will yield higher quality later
  • 24. The Future of Data Warehousing As the DW becomes a standard part of an organization, there will be efforts to find new ways to use the data. This will likely bring with it several new challenges : Regulatory constraints may limit the ability to combine sources of disparate data. These disparate sources are likely to contain unstructured data , which is hard to store. The Internet makes it possible to access data from virtually “anywhere”. Of course, this just increases the disparity.
  • 25. Data Integration is Hard Data warehouses combine data from multiple sources Data must be translated into a consistent format Data integration represents ~80% of effort for a typical data warehouse project! Some reasons why it’s hard: Metadata is poor or non-existent Data quality is often bad Missing or default values Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) Inconsistent semantics What is an airline passenger?
  • 26. Federated Databases An alternative to data warehouses Data warehouse Create a copy of all the data Execute queries against the copy Federated database Pull data from source systems as needed to answer queries “ lazy” vs. “eager” data integration Data Warehouse Federated Database Source Systems Source Systems Warehouse Mediator Query Answer Query Extraction Rewritten Queries Answer
  • 27. Warehouses vs. Federation Advantages of federated databases: No redundant copying of data Queries see “real-time” view of evolving data More flexible security policy Disadvantages of federated databases: Analysis queries place extra load on transactional systems Query optimization is hard to do well Historical data may not be available Complex “wrappers” needed to mediate between analysis server and source systems Data warehouses are much more common in practice Better performance Lower complexity Slightly out-of-date data is acceptable
  • 28. Visit www.jsbi.blogspot.com for more slides/information!! Mail : [email_address]