SlideShare a Scribd company logo
An Introduction
to Data
Warehousing
Presented by Animesh Srivastava
In the beginning..
 In the early 80s the concept of RDBMS ushered in an era of improved access to
the valuable information contained deep within data
 Our need for information grew exponentially and we needed a solution for an
efficient Decision Supporting System to run the business(OLAP)
 With the growing data OLTP systems became inefficient and not optimal for
complex query processing, reporting and analytical need
In the beginning..
 In the early 70s Bill Inmon also famous as “The Father of Data Warehouse”
coined the term “Data Warehouse”
 According to Inmon “Data Warehouse is a collection of integrated, subject-
oriented databases designed to support a DSS (decision support system),
where each unit of data is non-volatile and relevant to some moment in
time”
These concerns
have existed
for more than
three decades-
 “We collect tons of data, but we can’t
access it.”
 “We need to slice and dice the data
every which way.”
 “Business people need to get at the
data easily.”
 “Just show me what is important.”
 “We spend entire meetings arguing
about who has the right numbers rather
than making decisions.”
 “We want people to use information to
support more fact-based decision
making.”
Why we need DWH?
 Consolidation of information
resources across multiples
platforms and even geographies at
a single premise by extracting(E),
transforming(T) and finally
loading(L)
 Improved query performance
 Foundation of Data Mining, Data
Visualization, Advanced reporting
using BI, OLAP tools
Introduction to Data Warehousing
What is DWH
used for?
 Information and Knowledge:
 Intelligent Reporting and analytics
 Studying the trends and nature of
business over time
 Predicting the future with the help of
data
 Help business take key decisions and
make strategies
 Manage the history of transactions
happened within an organization
Examples:
What is DWH used
for?
Examples
 E-commerce
Providing cross sales, suggesting
different products like AMAZON
providing different features like
‘People also bought/viewed’.
Companies often use customer data
to predict or manipulate the
customer requirements considering
all the parameters like gender,
geography, age etc
What is DWH used
for?
Examples
 Supermarkets
One notable recent example of this was with
the US retailer Target. As part of its Data
Mining program, the company developed
rules to predict if their shoppers were likely
to be pregnant. By looking at the contents of
their customers shopping baskets, they could
spot customers who they thought were likely
to be expecting and begin targeting
promotions for nappies, cotton wool and so
on. The prediction was so accurate that
Target made the news by sending
promotional coupons to families who did not
yet realize they were pregnant!
What is DWH used for?
Examples
 Crime Agencies (where is crime
most likely to happen and
when?)
Crime prevention agencies use
analytics and Data Mining to spot
trends across myriads of data
helping with everything from
where to deploy police manpower.
What is DWH used
for?
Examples
 Service Providers-
Mobile phone and utilities companies
use Data Mining and Business
Intelligence to predict ‘churn’, the
terms they use for when a customer
leaves their company to get their
phone/gas/broadband from another
provider. They collate billing
information, customer services
interactions, website visits and other
metrics to give each customer a
probability score, then target offers
and incentives to customers whom they
perceive to be at a higher risk of
churning.
Responsibilities of DW/BI Managers:
 Understand the business users and determine the decisions that the business
users want to make with the help of the DW/BI system.
 Deliver high-quality, relevant, and accessible information and analytics to the
business users:
 Produce robust, presentable and meaningful data
 Continuously monitor the accuracy of the data and analyses
 Adapt to changing user profiles, requirements, and business priorities, along with
the availability of new data sources
 Sustain the DW/BI environment by updating the DW/BI system on a regular
basis and justify staffing and on-going expenditures
Pre-requisites before diving in…
 Types of Databases:
 OLTP
 OLAP
 What is Normalization?
 Types (1NF, 2NF, 3NF)
 Keys (Primary, Foreign, Composite, Surrogate)
 Data Modelling
 Conceptual
 Logical
 Physical
 E-R models
 Dimensional Modelling
 Star Schema
 Fact Tables
 Dimension tables
 Slowly changing dimensions (SCD – I, SCD – II, SCD – III)
OLTP and OLAP Databases
 Types of Databases:
 OLTP - Online transactional processing
 OLAP - Online analytical processing
 One of the most important assets of any organization is its information. This
asset is almost always used for two purposes: operational record keeping and
analytical decision making.
 Simply speaking, the operational systems (OLTP) are where you put the data
in, and the DW/BI system (OLAP) is where you get the data out.
OLTP vs OLAP
OLTP
 OLTPs are the original source of
the data.
 To control and run fundamental
business tasks
 Reveals a snapshot of ongoing
business processes
 Highly normalized with many
tables
 Low in space
 Volatile
OLAP
 OLAP data comes from the various
OLTP Databases
 To help with planning, problem
solving, and decision support
 Multi-dimensional views of various
kinds of business activities
 De-normalized, star schema
 High space
 Non-volatile
What is Normalization?
 Normalization is a database design technique which organizes tables in a
manner that reduces redundancy and dependency of data.
 It divides larger tables to smaller tables and links them using relationships.
 There are 3 prominent types of normalization but these techniques are still
evolving and we have totally 7 normalization techniques
 1NF
 2NF
 3NF
Types of
Normalization:
1NF
 Rules:
 Each table cell should
contain a single
value.
 Each record needs to
be unique
Types of
Normalization:
2NF
 Rules:
 The entity should be
in1NF form
 All attributes within the
entity should depend
solely on the unique
identifier of the entity
Types of Normalization:
3NF
 Product table  Brand table
Rules:
• The entity should be in 2NF form
• No column entry should be dependent on any other entry (value) other
than the key for the table
Types of Normalization:
3NF
 Product brand table  ER model:
Different types of keys -
 Primary Key - It identify each record uniquely in table. Primary key does not allow null value
in the column and keeps unique values throughout the column.
 Foreign Key - In a relationship between two tables, a primary key of one table is referred as a
foreign key in another table. Foreign key can have duplicate values in it and can also keep
null values if column is defined to accept nulls.
 Candidate Key- It can be selected as a primary key of the table. A table can have multiple
candidate keys, out of which one can be selected as a primary key.
 Unique Key - It can contain only unique values it also permits NULL values
 Alternate Key - It is a candidate key, not selected as primary key of the table.
 Composite Key (also known as compound key or concatenated key) - It is a group of two or
more columns that identifies each row of a table uniquely. Individual column of composite
key might not able to uniquely identify the record. It can be a primary key or candidate key
also.
 Super Key - It is a set of columns that uniquely identifies each row in a table. Super key may
hold some additional columns which are not strictly required to uniquely identify each row.
Primary key and candidate keys are minimal super keys or you can say subset of super keys.
Introduction to Data Warehousing
Data Modelling
 Conceptual representation of different database tables/objects which depicts
the blueprint of the whole schema.
 There are 3 types of data modelling techniques which can be mentioned in
the below hierarchical manner:
 Conceptual
 Logical
 Physical
Data Modelling
(Conceptual Model)
 The main aim of this model is to
establish the entities, their attributes,
and their relationships. It has very less
details available of the actual Database
structure.
 The 3 basic tenants of Data Model are
 Entity: A real-world thing (
Constomer & Product)
 Attribute: Characteristics or
properties of an entity
 Relationship: Dependency or
association between two entities.
Data Modelling
(Logical Model)
 It defines the structure of the data
elements and set the relationships
between them.
 The advantage of the Logical data
model is to provide a foundation to
form the base for the Physical
model.
 At this Data Modeling level, no
primary or secondary key is
defined.
Data Modelling
(Physical Model)
 It describes the database specific
implementation of the data model.
 It offers an abstraction of the
database and helps generate
schema.
 This type of Data model also helps
to visualize database structure.
 It helps to model database columns
keys, constraints, indexes,
triggers, and other RDBMS
features.
Entity Relationship Models/Diagrams
 Entity-relationship diagrams (ER diagrams or ERDs) are drawings that
communicate the relationships between tables.
 It is a type of flowchart that illustrates how “entities” such as people, objects
or concepts relate to each other within a system.
 ER Diagrams are most often used to design or debug relational databases.
 Also known as ERDs or ER Models, they use a defined set of symbols such as
rectangles, diamonds, ovals and connecting lines to depict the
interconnectedness of entities, relationships and their attributes.
Introduction to Data Warehousing
Dimensional Modelling
 Dimensional modeling is widely accepted as the preferred technique for
presenting analytic data because it addresses two simultaneous requirements:
 Deliver data that’s understandable to the business users.
 Deliver fast query performance.
 Both 3NF and dimensional models can be represented in ERDs because both
consist of joined relational tables; the key difference between 3NF and
dimensional models is the degree of normalization.
 Normalized 3NF structures divide data into many discrete entities, each of
which becomes a relational table. A database of sales orders might start with
a record for each order line but turn into a complex spider web diagram as a
3NF model, perhaps consisting of hundreds of normalized tables.
Introduction to Data Warehousing
Dimensional Modelling (contd.)
 Normalized 3NF structures are immensely useful in operational processing
(OLTP) because an update or insert transaction touches the database in only
one place.
 Normalized models, however, are too complicated for BI queries. Users can’t
understand, navigate, or remember normalized models that resemble a map
of a city.
 The complexity of users’ unpredictable queries overwhelms the database
optimizers, resulting in disastrous query performance.
 Fortunately, dimensional modeling addresses the problem of overly complex
schemas in the presentation area.
 A dimensional model contains the same information as a normalized model,
but packages the data in a format that delivers user understandability, query
performance, and resilience to change.
Dimensional Modelling
– Star Schema
 Dimensional models
implemented in relational
database management systems
are referred to as star schemas
because of their resemblance to
a star-like structure.
 The downside of dimensional
modelling is that you pay a load
performance price for these
capabilities, especially with large
data sets.
Dimensional Modelling
– Fact Tables
 The fact table in a dimensional
model stores the performance
measurements resulting from an
organization’s business process
events.
 You should strive to store the low-
level measurement data resulting
from a business process in a single
dimensional model (grain ex: "balls
per innings").
 Imagine standing in the marketplace
watching products being sold and
writing down the unit quantity and
dollar sales amount for each product
in each sales transaction.
Dimensional Modelling – Fact Tables
(contd.)
 Types of facts additive, semi-additive, non-additive:
 Additive - Additive measures can be summed across any of the dimensions
associated with the fact table. (sales amount)
 Semi-Additive – They can be summed across some dimensions, but not all; balance
amounts are common semi-additive facts because they are additive across all
dimensions except time
 Non – Additive - Some measures are completely non-additive, such as ratios,
percentages and percentiles.
 Despite their sparsity, fact tables usually make up 90 percent or more of the
total space consumed by a dimensional model.
 Fact tables tend to be deep in terms of the number of rows, but narrow in
terms of the number of columns.
Dimensional Modelling
– Dimension Tables
 The dimension tables contain the textual
context associated with a business process
measurement event.
 They describe the “who, what, where, when,
how, and why” associated with the event.
 Dimension tables often have many columns or
attributes.
 Dimension tables tend to have fewer rows
than fact tables, but can be wide with many
large text columns.
 Each dimension is defined by a single primary
key , which serves as the basis for referential
integrity with any given fact table to which it
is joined.
Dimensional Modelling
– Dimension Tables
 Dimension attributes serve as the primary
source of query constraints, groupings, and
report labels.
 Dimension attributes are critical to making
the DW/BI system usable and understandable.
 The analytic power of the DW/BI environment
is directly proportional to the quality and
depth of the dimension attributes.
 Instead of third normal form, dimension
tables typically are highly denormalized with
flattened many-to-one relationships within a
single dimension table.
 We can almost always trade off dimension
table space for simplicity and accessibility.
Facts and Dimensions joined in a star
schema
 The first thing to notice about the dimensional schema is its simplicity and
symmetry.
 The charm of the design in is that it is highly recognizable to business users.
 Furthermore, the reduced number of tables and use of meaningful business
descriptors make it easy to navigate and less likely that mistakes will occur.
 The simplicity of a dimensional model also has performance benefits.
 Database optimizers process these simple schemas with fewer joins more
efficiently.
 Dimension attributes supply the report filters and labeling, whereas the fact
tables supply the report’s numeric values.
Facts and Dimensions
joined in a star
schema
SELECT st.district_name,
pd.brand,
SUM(sf.sales_dollars) AS "Sales
Dollars"
FROM store st,--dimension table
product pd,--dimension table
DATE dt,--dimension table
sales_facts sf --fact table
WHERE dt.month_name ="January"
AND dt.year =2013
AND st.store_key = sf.store_key
AND pd.product_key = sf.product_key
AND dt.date_key = sf.date_key
GROUP BY st.district_name, pd.brand
Slowly changing dimensions
 It is a dimension that stores and manages both current and historical data
over time in a data warehouse. It is considered and implemented as one of
the most critical ETL tasks in tracking the history of dimension records.
 There are three types of SCDs and you can use Warehouse Builder to define,
deploy, and load all three types of SCDs.
 Type 1 SCDs - Overwriting
 Type 2 SCDs - Creating another dimension record
 Type 3 SCDs - Creating a current value field
Slowly Changing
Dimensions (contd.)
 Type 1 -Overwriting the old
value. In this method no history
of dimension changes is kept in
the database. The old dimension
value is simply overwritten be
the new one.
Slowly Changing
Dimensions (contd.)
 Type 2 - Creating a new
additional record. In this
methodology all history of
dimension changes is kept in the
database. Changes in the
attributes are captured by
adding a new row with a new
surrogate key to the dimension
table.
Slowly Changing
Dimensions (contd.)
 Type 3 - Adding a new column. In
this type usually only the current
and previous value of dimension
is kept in the database. The new
value is loaded into 'current/new'
column and the old one into
'old/previous' column. Generally
speaking the history is limited to
the number of column created
for storing historical data. This is
the least commonly needed
technique.
Data Marts
 A data mart is focused on a single functional
area of an organization and contains a subset
of data stored in a Data Warehouse.
 A data mart is a condensed version of Data
Warehouse and is designed for use by a
specific department, unit or set of users in an
organization. E.g., Marketing, Sales, HR or
finance.
 It is often controlled by a single department
in an organization.
 Data Mart usually draws data from only a few
sources compared to a Data warehouse.
 Data marts are small in size and are more
flexible compared to a Data warehouse.
A glimpse of Data Warehouse
Architecture
Data Warehouse Architecture
Data Warehouse Architecture (Inmon)
Data Warehouse Architecture (Kimball)
Data Warehouse Architectures (In Detail)
 Topics:
 Independent Data Mart Architecture
 Kimball Architecture
 Inmon Architecture (Hub-and-Spoke Corporate Information Factory)
 Key differences between Inmon and Kimball architectures
Independent Data
Mart Architecture
 Data is deployed on a departmental
basis without concern to sharing and
integrating information across the
enterprise.
 Its less recommended but this approach
is prevalent, especially in large
organizations.
 It’s the path of least resistance for fast
development at relatively low cost, at
least in the short run.
 Multiple uncoordinated extracts from
the same operational sources and
redundant storage of analytic data are
inefficient and wasteful in the long run.
Kimball Architecture
 There are four separate and distinct components to consider in the DW/BI
environment:
 Operational source systems (OLTP)
 ETL system
 Data presentation area (Dimensional Model)
 Business intelligence applications.
Kimball Architecture (contd.)
Kimball Architecture (contd.)
 Operational Source Systems:
 These are the operational systems of record that capture the business’s
transactions (OLTP)
 Extract, Transformation, and Load System:
 Extraction is the first step in the process of getting data into the data warehouse
environment
 There are numerous potential transformations, such as :
 Cleansing the data. Ex converting to standard date formats or correcting the misspellings.
 Combining data from multiple sources, and de-duplicating data.
 Primary mission of the ETL system is to hand off the dimension and fact tables in
the delivery step, these subsystems are critical.
Kimball Architecture (contd.)
 Data presentation area:
 The dimensional model with the star schema which is the byproduct of the whole
ETL design.
 Business Intelligence Applications:
 By definition, all BI applications query the data in the DW/BI presentation area to
generate the data in more meaningful and descriptive form using various reporting
techniques like pie charts, bar graphs, cross tabs etc.
Hub and spoke
Corporate Information
Factory (Inmon)
 This model identifies the key subject areas,
and most importantly, the key entities the
business operates with and cares about, like
customer, product, vendor, etc.
 Detailed logical model is created for each
major entity. Ex: A logical model will be built
for Customer with all the details related to
that entity.
 The key point here is that the entity structure
is built in normalized form (3NF), data
redundancy is avoided as much as possible.
 Data marts specific for departments are built
on top of the 3NF model and the data marts
can have de-normalized data to help with
reporting/ fast querying.
Kimball vs Inmon Architectures
Characteristics Favours Kimball Favours Inmon
Business decision support
requirements
Tactical Strategic
Data integration
requirements
Individual business
requirements
Enterprise-wide integration
The structure of data
KPI, business performance
measures, scorecards…
Data that meet multiple and
varied information needs
and non-metric data
Persistence of data in
source systems
Source systems are quite
stable
Source systems have high
rate of change
Skill sets Small team of generalists Bigger team of specialists
Time constraint
Urgent needs for the first
data warehouse
Longer time is allowed to
meet business’ needs.
Cost to build Low start-up cost High start-up costs
Refrences
 The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling,
2nd Edition Ralph Kimball, Margy Ross
 Introduction to Data Warehousing Concepts docs.oracle.com
 A Short History of Data Warehousing – DATAVERSITY www.dataversity.net
 Data integration – Wikipedia en.wikipedia.org
 Data Warehouse Design: The Good, the Bad, the Ugly blog.panoply.io
 ER Model Basic Conceptswww.tutorialspoint.com
 Oracle ATG Web Commerce - Search ERDdocs.oracle.com
 Document Entities | Oracle Magazineblogs.oracle.com
Thank You!

More Related Content

PPTX
Database Concepts and Terminologies
PDF
International Refereed Journal of Engineering and Science (IRJES)
PPTX
Fact table design for data ware house
DOC
Difference between ER-Modeling and Dimensional Modeling
PPT
E-R vs Starschema
PPS
ความรู้เบื้องต้นฐานข้อมูล 1
PPTX
Data resource management and DSS
DOCX
Database Concepts and Terminologies
International Refereed Journal of Engineering and Science (IRJES)
Fact table design for data ware house
Difference between ER-Modeling and Dimensional Modeling
E-R vs Starschema
ความรู้เบื้องต้นฐานข้อมูล 1
Data resource management and DSS

What's hot (20)

PDF
Difference between fact tables and dimension tables
PPTX
Data modeling dbms
PPT
Introduction to Database Concepts
PDF
Building a data warehouse
PPTX
Databases and its representation
PPTX
Fact table facts
PPT
Dimensional Modeling
PPTX
PPT
Data processing
PPTX
Database and types of database
PPTX
Software Programs for Data Analysis
PPTX
Design approach
PPT
Data Processing-Presentation
PPTX
Data Modeling Basics
PPTX
PPTX
Intro To DataBase
PPTX
Database Basics
PPT
Star schema PPT
PPT
Database intro
PPT
Week 1 Before the Advent of Database Systems & Fundamental Concepts
Difference between fact tables and dimension tables
Data modeling dbms
Introduction to Database Concepts
Building a data warehouse
Databases and its representation
Fact table facts
Dimensional Modeling
Data processing
Database and types of database
Software Programs for Data Analysis
Design approach
Data Processing-Presentation
Data Modeling Basics
Intro To DataBase
Database Basics
Star schema PPT
Database intro
Week 1 Before the Advent of Database Systems & Fundamental Concepts
Ad

Similar to Introduction to Data Warehousing (20)

PPT
Msbi by quontra us
PPT
Introduction To Msbi By Yasir
DOCX
Data miningvs datawarehouse
PDF
Cs437 lecture 1-6
PPT
Data Warehousing Datamining Concepts
PPT
2. olap warehouse
PPTX
Data Management
PPTX
Data warehouse
PPT
Gulabs Ppt On Data Warehousing And Mining
PPTX
DATAWAREHOUSE MAIn under data mining for
PPTX
Data warehouse physical design
PPTX
1-Data Warehousing-Multi Dim Data Model.pptx
PPT
Difference between data warehouse and data mining
PPTX
Dataware house multidimensionalmodelling
PPTX
INFORMATICA EASY LEARNING ONLINE TRAINING
PPT
1-_Intro_to_Data_Minning__DWH.ppt
PPT
Chapter29.ppt
PPTX
Online analytical processing
PDF
Overview of business intelligence
PPTX
Introduction to Datawarehousing.
Msbi by quontra us
Introduction To Msbi By Yasir
Data miningvs datawarehouse
Cs437 lecture 1-6
Data Warehousing Datamining Concepts
2. olap warehouse
Data Management
Data warehouse
Gulabs Ppt On Data Warehousing And Mining
DATAWAREHOUSE MAIn under data mining for
Data warehouse physical design
1-Data Warehousing-Multi Dim Data Model.pptx
Difference between data warehouse and data mining
Dataware house multidimensionalmodelling
INFORMATICA EASY LEARNING ONLINE TRAINING
1-_Intro_to_Data_Minning__DWH.ppt
Chapter29.ppt
Online analytical processing
Overview of business intelligence
Introduction to Datawarehousing.
Ad

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Database Infoormation System (DBIS).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
Introduction to Business Data Analytics.
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
IBA_Chapter_11_Slides_Final_Accessible.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Launch Your Data Science Career in Kochi – 2025
Business Ppt On Nestle.pptx huunnnhhgfvu
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
Database Infoormation System (DBIS).pptx
Clinical guidelines as a resource for EBP(1).pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx

Introduction to Data Warehousing

  • 2. In the beginning..  In the early 80s the concept of RDBMS ushered in an era of improved access to the valuable information contained deep within data  Our need for information grew exponentially and we needed a solution for an efficient Decision Supporting System to run the business(OLAP)  With the growing data OLTP systems became inefficient and not optimal for complex query processing, reporting and analytical need
  • 3. In the beginning..  In the early 70s Bill Inmon also famous as “The Father of Data Warehouse” coined the term “Data Warehouse”  According to Inmon “Data Warehouse is a collection of integrated, subject- oriented databases designed to support a DSS (decision support system), where each unit of data is non-volatile and relevant to some moment in time”
  • 4. These concerns have existed for more than three decades-  “We collect tons of data, but we can’t access it.”  “We need to slice and dice the data every which way.”  “Business people need to get at the data easily.”  “Just show me what is important.”  “We spend entire meetings arguing about who has the right numbers rather than making decisions.”  “We want people to use information to support more fact-based decision making.”
  • 5. Why we need DWH?  Consolidation of information resources across multiples platforms and even geographies at a single premise by extracting(E), transforming(T) and finally loading(L)  Improved query performance  Foundation of Data Mining, Data Visualization, Advanced reporting using BI, OLAP tools
  • 7. What is DWH used for?  Information and Knowledge:  Intelligent Reporting and analytics  Studying the trends and nature of business over time  Predicting the future with the help of data  Help business take key decisions and make strategies  Manage the history of transactions happened within an organization Examples:
  • 8. What is DWH used for? Examples  E-commerce Providing cross sales, suggesting different products like AMAZON providing different features like ‘People also bought/viewed’. Companies often use customer data to predict or manipulate the customer requirements considering all the parameters like gender, geography, age etc
  • 9. What is DWH used for? Examples  Supermarkets One notable recent example of this was with the US retailer Target. As part of its Data Mining program, the company developed rules to predict if their shoppers were likely to be pregnant. By looking at the contents of their customers shopping baskets, they could spot customers who they thought were likely to be expecting and begin targeting promotions for nappies, cotton wool and so on. The prediction was so accurate that Target made the news by sending promotional coupons to families who did not yet realize they were pregnant!
  • 10. What is DWH used for? Examples  Crime Agencies (where is crime most likely to happen and when?) Crime prevention agencies use analytics and Data Mining to spot trends across myriads of data helping with everything from where to deploy police manpower.
  • 11. What is DWH used for? Examples  Service Providers- Mobile phone and utilities companies use Data Mining and Business Intelligence to predict ‘churn’, the terms they use for when a customer leaves their company to get their phone/gas/broadband from another provider. They collate billing information, customer services interactions, website visits and other metrics to give each customer a probability score, then target offers and incentives to customers whom they perceive to be at a higher risk of churning.
  • 12. Responsibilities of DW/BI Managers:  Understand the business users and determine the decisions that the business users want to make with the help of the DW/BI system.  Deliver high-quality, relevant, and accessible information and analytics to the business users:  Produce robust, presentable and meaningful data  Continuously monitor the accuracy of the data and analyses  Adapt to changing user profiles, requirements, and business priorities, along with the availability of new data sources  Sustain the DW/BI environment by updating the DW/BI system on a regular basis and justify staffing and on-going expenditures
  • 13. Pre-requisites before diving in…  Types of Databases:  OLTP  OLAP  What is Normalization?  Types (1NF, 2NF, 3NF)  Keys (Primary, Foreign, Composite, Surrogate)  Data Modelling  Conceptual  Logical  Physical  E-R models  Dimensional Modelling  Star Schema  Fact Tables  Dimension tables  Slowly changing dimensions (SCD – I, SCD – II, SCD – III)
  • 14. OLTP and OLAP Databases  Types of Databases:  OLTP - Online transactional processing  OLAP - Online analytical processing  One of the most important assets of any organization is its information. This asset is almost always used for two purposes: operational record keeping and analytical decision making.  Simply speaking, the operational systems (OLTP) are where you put the data in, and the DW/BI system (OLAP) is where you get the data out.
  • 15. OLTP vs OLAP OLTP  OLTPs are the original source of the data.  To control and run fundamental business tasks  Reveals a snapshot of ongoing business processes  Highly normalized with many tables  Low in space  Volatile OLAP  OLAP data comes from the various OLTP Databases  To help with planning, problem solving, and decision support  Multi-dimensional views of various kinds of business activities  De-normalized, star schema  High space  Non-volatile
  • 16. What is Normalization?  Normalization is a database design technique which organizes tables in a manner that reduces redundancy and dependency of data.  It divides larger tables to smaller tables and links them using relationships.  There are 3 prominent types of normalization but these techniques are still evolving and we have totally 7 normalization techniques  1NF  2NF  3NF
  • 17. Types of Normalization: 1NF  Rules:  Each table cell should contain a single value.  Each record needs to be unique
  • 18. Types of Normalization: 2NF  Rules:  The entity should be in1NF form  All attributes within the entity should depend solely on the unique identifier of the entity
  • 19. Types of Normalization: 3NF  Product table  Brand table Rules: • The entity should be in 2NF form • No column entry should be dependent on any other entry (value) other than the key for the table
  • 20. Types of Normalization: 3NF  Product brand table  ER model:
  • 21. Different types of keys -  Primary Key - It identify each record uniquely in table. Primary key does not allow null value in the column and keeps unique values throughout the column.  Foreign Key - In a relationship between two tables, a primary key of one table is referred as a foreign key in another table. Foreign key can have duplicate values in it and can also keep null values if column is defined to accept nulls.  Candidate Key- It can be selected as a primary key of the table. A table can have multiple candidate keys, out of which one can be selected as a primary key.  Unique Key - It can contain only unique values it also permits NULL values  Alternate Key - It is a candidate key, not selected as primary key of the table.  Composite Key (also known as compound key or concatenated key) - It is a group of two or more columns that identifies each row of a table uniquely. Individual column of composite key might not able to uniquely identify the record. It can be a primary key or candidate key also.  Super Key - It is a set of columns that uniquely identifies each row in a table. Super key may hold some additional columns which are not strictly required to uniquely identify each row. Primary key and candidate keys are minimal super keys or you can say subset of super keys.
  • 23. Data Modelling  Conceptual representation of different database tables/objects which depicts the blueprint of the whole schema.  There are 3 types of data modelling techniques which can be mentioned in the below hierarchical manner:  Conceptual  Logical  Physical
  • 24. Data Modelling (Conceptual Model)  The main aim of this model is to establish the entities, their attributes, and their relationships. It has very less details available of the actual Database structure.  The 3 basic tenants of Data Model are  Entity: A real-world thing ( Constomer & Product)  Attribute: Characteristics or properties of an entity  Relationship: Dependency or association between two entities.
  • 25. Data Modelling (Logical Model)  It defines the structure of the data elements and set the relationships between them.  The advantage of the Logical data model is to provide a foundation to form the base for the Physical model.  At this Data Modeling level, no primary or secondary key is defined.
  • 26. Data Modelling (Physical Model)  It describes the database specific implementation of the data model.  It offers an abstraction of the database and helps generate schema.  This type of Data model also helps to visualize database structure.  It helps to model database columns keys, constraints, indexes, triggers, and other RDBMS features.
  • 27. Entity Relationship Models/Diagrams  Entity-relationship diagrams (ER diagrams or ERDs) are drawings that communicate the relationships between tables.  It is a type of flowchart that illustrates how “entities” such as people, objects or concepts relate to each other within a system.  ER Diagrams are most often used to design or debug relational databases.  Also known as ERDs or ER Models, they use a defined set of symbols such as rectangles, diamonds, ovals and connecting lines to depict the interconnectedness of entities, relationships and their attributes.
  • 29. Dimensional Modelling  Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses two simultaneous requirements:  Deliver data that’s understandable to the business users.  Deliver fast query performance.  Both 3NF and dimensional models can be represented in ERDs because both consist of joined relational tables; the key difference between 3NF and dimensional models is the degree of normalization.  Normalized 3NF structures divide data into many discrete entities, each of which becomes a relational table. A database of sales orders might start with a record for each order line but turn into a complex spider web diagram as a 3NF model, perhaps consisting of hundreds of normalized tables.
  • 31. Dimensional Modelling (contd.)  Normalized 3NF structures are immensely useful in operational processing (OLTP) because an update or insert transaction touches the database in only one place.  Normalized models, however, are too complicated for BI queries. Users can’t understand, navigate, or remember normalized models that resemble a map of a city.  The complexity of users’ unpredictable queries overwhelms the database optimizers, resulting in disastrous query performance.  Fortunately, dimensional modeling addresses the problem of overly complex schemas in the presentation area.  A dimensional model contains the same information as a normalized model, but packages the data in a format that delivers user understandability, query performance, and resilience to change.
  • 32. Dimensional Modelling – Star Schema  Dimensional models implemented in relational database management systems are referred to as star schemas because of their resemblance to a star-like structure.  The downside of dimensional modelling is that you pay a load performance price for these capabilities, especially with large data sets.
  • 33. Dimensional Modelling – Fact Tables  The fact table in a dimensional model stores the performance measurements resulting from an organization’s business process events.  You should strive to store the low- level measurement data resulting from a business process in a single dimensional model (grain ex: "balls per innings").  Imagine standing in the marketplace watching products being sold and writing down the unit quantity and dollar sales amount for each product in each sales transaction.
  • 34. Dimensional Modelling – Fact Tables (contd.)  Types of facts additive, semi-additive, non-additive:  Additive - Additive measures can be summed across any of the dimensions associated with the fact table. (sales amount)  Semi-Additive – They can be summed across some dimensions, but not all; balance amounts are common semi-additive facts because they are additive across all dimensions except time  Non – Additive - Some measures are completely non-additive, such as ratios, percentages and percentiles.  Despite their sparsity, fact tables usually make up 90 percent or more of the total space consumed by a dimensional model.  Fact tables tend to be deep in terms of the number of rows, but narrow in terms of the number of columns.
  • 35. Dimensional Modelling – Dimension Tables  The dimension tables contain the textual context associated with a business process measurement event.  They describe the “who, what, where, when, how, and why” associated with the event.  Dimension tables often have many columns or attributes.  Dimension tables tend to have fewer rows than fact tables, but can be wide with many large text columns.  Each dimension is defined by a single primary key , which serves as the basis for referential integrity with any given fact table to which it is joined.
  • 36. Dimensional Modelling – Dimension Tables  Dimension attributes serve as the primary source of query constraints, groupings, and report labels.  Dimension attributes are critical to making the DW/BI system usable and understandable.  The analytic power of the DW/BI environment is directly proportional to the quality and depth of the dimension attributes.  Instead of third normal form, dimension tables typically are highly denormalized with flattened many-to-one relationships within a single dimension table.  We can almost always trade off dimension table space for simplicity and accessibility.
  • 37. Facts and Dimensions joined in a star schema  The first thing to notice about the dimensional schema is its simplicity and symmetry.  The charm of the design in is that it is highly recognizable to business users.  Furthermore, the reduced number of tables and use of meaningful business descriptors make it easy to navigate and less likely that mistakes will occur.  The simplicity of a dimensional model also has performance benefits.  Database optimizers process these simple schemas with fewer joins more efficiently.  Dimension attributes supply the report filters and labeling, whereas the fact tables supply the report’s numeric values.
  • 38. Facts and Dimensions joined in a star schema SELECT st.district_name, pd.brand, SUM(sf.sales_dollars) AS "Sales Dollars" FROM store st,--dimension table product pd,--dimension table DATE dt,--dimension table sales_facts sf --fact table WHERE dt.month_name ="January" AND dt.year =2013 AND st.store_key = sf.store_key AND pd.product_key = sf.product_key AND dt.date_key = sf.date_key GROUP BY st.district_name, pd.brand
  • 39. Slowly changing dimensions  It is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.  There are three types of SCDs and you can use Warehouse Builder to define, deploy, and load all three types of SCDs.  Type 1 SCDs - Overwriting  Type 2 SCDs - Creating another dimension record  Type 3 SCDs - Creating a current value field
  • 40. Slowly Changing Dimensions (contd.)  Type 1 -Overwriting the old value. In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one.
  • 41. Slowly Changing Dimensions (contd.)  Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. Changes in the attributes are captured by adding a new row with a new surrogate key to the dimension table.
  • 42. Slowly Changing Dimensions (contd.)  Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly needed technique.
  • 43. Data Marts  A data mart is focused on a single functional area of an organization and contains a subset of data stored in a Data Warehouse.  A data mart is a condensed version of Data Warehouse and is designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or finance.  It is often controlled by a single department in an organization.  Data Mart usually draws data from only a few sources compared to a Data warehouse.  Data marts are small in size and are more flexible compared to a Data warehouse.
  • 44. A glimpse of Data Warehouse Architecture
  • 48. Data Warehouse Architectures (In Detail)  Topics:  Independent Data Mart Architecture  Kimball Architecture  Inmon Architecture (Hub-and-Spoke Corporate Information Factory)  Key differences between Inmon and Kimball architectures
  • 49. Independent Data Mart Architecture  Data is deployed on a departmental basis without concern to sharing and integrating information across the enterprise.  Its less recommended but this approach is prevalent, especially in large organizations.  It’s the path of least resistance for fast development at relatively low cost, at least in the short run.  Multiple uncoordinated extracts from the same operational sources and redundant storage of analytic data are inefficient and wasteful in the long run.
  • 50. Kimball Architecture  There are four separate and distinct components to consider in the DW/BI environment:  Operational source systems (OLTP)  ETL system  Data presentation area (Dimensional Model)  Business intelligence applications.
  • 52. Kimball Architecture (contd.)  Operational Source Systems:  These are the operational systems of record that capture the business’s transactions (OLTP)  Extract, Transformation, and Load System:  Extraction is the first step in the process of getting data into the data warehouse environment  There are numerous potential transformations, such as :  Cleansing the data. Ex converting to standard date formats or correcting the misspellings.  Combining data from multiple sources, and de-duplicating data.  Primary mission of the ETL system is to hand off the dimension and fact tables in the delivery step, these subsystems are critical.
  • 53. Kimball Architecture (contd.)  Data presentation area:  The dimensional model with the star schema which is the byproduct of the whole ETL design.  Business Intelligence Applications:  By definition, all BI applications query the data in the DW/BI presentation area to generate the data in more meaningful and descriptive form using various reporting techniques like pie charts, bar graphs, cross tabs etc.
  • 54. Hub and spoke Corporate Information Factory (Inmon)  This model identifies the key subject areas, and most importantly, the key entities the business operates with and cares about, like customer, product, vendor, etc.  Detailed logical model is created for each major entity. Ex: A logical model will be built for Customer with all the details related to that entity.  The key point here is that the entity structure is built in normalized form (3NF), data redundancy is avoided as much as possible.  Data marts specific for departments are built on top of the 3NF model and the data marts can have de-normalized data to help with reporting/ fast querying.
  • 55. Kimball vs Inmon Architectures Characteristics Favours Kimball Favours Inmon Business decision support requirements Tactical Strategic Data integration requirements Individual business requirements Enterprise-wide integration The structure of data KPI, business performance measures, scorecards… Data that meet multiple and varied information needs and non-metric data Persistence of data in source systems Source systems are quite stable Source systems have high rate of change Skill sets Small team of generalists Bigger team of specialists Time constraint Urgent needs for the first data warehouse Longer time is allowed to meet business’ needs. Cost to build Low start-up cost High start-up costs
  • 56. Refrences  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition Ralph Kimball, Margy Ross  Introduction to Data Warehousing Concepts docs.oracle.com  A Short History of Data Warehousing – DATAVERSITY www.dataversity.net  Data integration – Wikipedia en.wikipedia.org  Data Warehouse Design: The Good, the Bad, the Ugly blog.panoply.io  ER Model Basic Conceptswww.tutorialspoint.com  Oracle ATG Web Commerce - Search ERDdocs.oracle.com  Document Entities | Oracle Magazineblogs.oracle.com