Introduction to Data Warehousing

An Introduction
to Data
Warehousing
Presented by Animesh Srivastava

In the beginning..
 In the early 80s the concept of RDBMS ushered in an era of improved access to
the valuable information contained deep within data
 Our need for information grew exponentially and we needed a solution for an
efficient Decision Supporting System to run the business(OLAP)
 With the growing data OLTP systems became inefficient and not optimal for
complex query processing, reporting and analytical need

In the beginning..
 In the early 70s Bill Inmon also famous as “The Father of Data Warehouse”
coined the term “Data Warehouse”
 According to Inmon “Data Warehouse is a collection of integrated, subject-
oriented databases designed to support a DSS (decision support system),
where each unit of data is non-volatile and relevant to some moment in
time”

These concerns
have existed
for more than
three decades-
 “We collect tons of data, but we can’t
access it.”
 “We need to slice and dice the data
every which way.”
 “Business people need to get at the
data easily.”
 “Just show me what is important.”
 “We spend entire meetings arguing
about who has the right numbers rather
than making decisions.”
 “We want people to use information to
support more fact-based decision
making.”

Why we need DWH?
 Consolidation of information
resources across multiples
platforms and even geographies at
a single premise by extracting(E),
transforming(T) and finally
loading(L)
 Improved query performance
 Foundation of Data Mining, Data
Visualization, Advanced reporting
using BI, OLAP tools

Introduction to Data Warehousing

What is DWH
used for?
 Information and Knowledge:
 Intelligent Reporting and analytics
 Studying the trends and nature of
business over time
 Predicting the future with the help of
data
 Help business take key decisions and
make strategies
 Manage the history of transactions
happened within an organization
Examples:

What is DWH used
for?
Examples
 E-commerce
Providing cross sales, suggesting
different products like AMAZON
providing different features like
‘People also bought/viewed’.
Companies often use customer data
to predict or manipulate the
customer requirements considering
all the parameters like gender,
geography, age etc

What is DWH used
for?
Examples
 Supermarkets
One notable recent example of this was with
the US retailer Target. As part of its Data
Mining program, the company developed
rules to predict if their shoppers were likely
to be pregnant. By looking at the contents of
their customers shopping baskets, they could
spot customers who they thought were likely
to be expecting and begin targeting
promotions for nappies, cotton wool and so
on. The prediction was so accurate that
Target made the news by sending
promotional coupons to families who did not
yet realize they were pregnant!

What is DWH used for?
Examples
 Crime Agencies (where is crime
most likely to happen and
when?)
Crime prevention agencies use
analytics and Data Mining to spot
trends across myriads of data
helping with everything from
where to deploy police manpower.

What is DWH used
for?
Examples
 Service Providers-
Mobile phone and utilities companies
use Data Mining and Business
Intelligence to predict ‘churn’, the
terms they use for when a customer
leaves their company to get their
phone/gas/broadband from another
provider. They collate billing
information, customer services
interactions, website visits and other
metrics to give each customer a
probability score, then target offers
and incentives to customers whom they
perceive to be at a higher risk of
churning.

Responsibilities of DW/BI Managers:
 Understand the business users and determine the decisions that the business
users want to make with the help of the DW/BI system.
 Deliver high-quality, relevant, and accessible information and analytics to the
business users:
 Produce robust, presentable and meaningful data
 Continuously monitor the accuracy of the data and analyses
 Adapt to changing user profiles, requirements, and business priorities, along with
the availability of new data sources
 Sustain the DW/BI environment by updating the DW/BI system on a regular
basis and justify staffing and on-going expenditures

Pre-requisites before diving in…
 Types of Databases:
 OLTP
 OLAP
 What is Normalization?
 Types (1NF, 2NF, 3NF)
 Keys (Primary, Foreign, Composite, Surrogate)
 Data Modelling
 Conceptual
 Logical
 Physical
 E-R models
 Dimensional Modelling
 Star Schema
 Fact Tables
 Dimension tables
 Slowly changing dimensions (SCD – I, SCD – II, SCD – III)

OLTP and OLAP Databases
 Types of Databases:
 OLTP - Online transactional processing
 OLAP - Online analytical processing
 One of the most important assets of any organization is its information. This
asset is almost always used for two purposes: operational record keeping and
analytical decision making.
 Simply speaking, the operational systems (OLTP) are where you put the data
in, and the DW/BI system (OLAP) is where you get the data out.

OLTP vs OLAP
OLTP
 OLTPs are the original source of
the data.
 To control and run fundamental
business tasks
 Reveals a snapshot of ongoing
business processes
 Highly normalized with many
tables
 Low in space
 Volatile
OLAP
 OLAP data comes from the various
OLTP Databases
 To help with planning, problem
solving, and decision support
 Multi-dimensional views of various
kinds of business activities
 De-normalized, star schema
 High space
 Non-volatile

What is Normalization?
 Normalization is a database design technique which organizes tables in a
manner that reduces redundancy and dependency of data.
 It divides larger tables to smaller tables and links them using relationships.
 There are 3 prominent types of normalization but these techniques are still
evolving and we have totally 7 normalization techniques
 1NF
 2NF
 3NF

Types of
Normalization:
1NF
 Rules:
 Each table cell should
contain a single
value.
 Each record needs to
be unique

Types of
Normalization:
2NF
 Rules:
 The entity should be
in1NF form
 All attributes within the
entity should depend
solely on the unique
identifier of the entity

Types of Normalization:
3NF
 Product table  Brand table
Rules:
• The entity should be in 2NF form
• No column entry should be dependent on any other entry (value) other
than the key for the table

Types of Normalization:
3NF
 Product brand table  ER model:

Different types of keys -
 Primary Key - It identify each record uniquely in table. Primary key does not allow null value
in the column and keeps unique values throughout the column.
 Foreign Key - In a relationship between two tables, a primary key of one table is referred as a
foreign key in another table. Foreign key can have duplicate values in it and can also keep
null values if column is defined to accept nulls.
 Candidate Key- It can be selected as a primary key of the table. A table can have multiple
candidate keys, out of which one can be selected as a primary key.
 Unique Key - It can contain only unique values it also permits NULL values
 Alternate Key - It is a candidate key, not selected as primary key of the table.
 Composite Key (also known as compound key or concatenated key) - It is a group of two or
more columns that identifies each row of a table uniquely. Individual column of composite
key might not able to uniquely identify the record. It can be a primary key or candidate key
also.
 Super Key - It is a set of columns that uniquely identifies each row in a table. Super key may
hold some additional columns which are not strictly required to uniquely identify each row.
Primary key and candidate keys are minimal super keys or you can say subset of super keys.

Data Modelling
 Conceptual representation of different database tables/objects which depicts
the blueprint of the whole schema.
 There are 3 types of data modelling techniques which can be mentioned in
the below hierarchical manner:
 Conceptual
 Logical
 Physical

Data Modelling
(Conceptual Model)
 The main aim of this model is to
establish the entities, their attributes,
and their relationships. It has very less
details available of the actual Database
structure.
 The 3 basic tenants of Data Model are
 Entity: A real-world thing (
Constomer & Product)
 Attribute: Characteristics or
properties of an entity
 Relationship: Dependency or
association between two entities.

Data Modelling
(Logical Model)
 It defines the structure of the data
elements and set the relationships
between them.
 The advantage of the Logical data
model is to provide a foundation to
form the base for the Physical
model.
 At this Data Modeling level, no
primary or secondary key is
defined.

Data Modelling
(Physical Model)
 It describes the database specific
implementation of the data model.
 It offers an abstraction of the
database and helps generate
schema.
 This type of Data model also helps
to visualize database structure.
 It helps to model database columns
keys, constraints, indexes,
triggers, and other RDBMS
features.

Entity Relationship Models/Diagrams
 Entity-relationship diagrams (ER diagrams or ERDs) are drawings that
communicate the relationships between tables.
 It is a type of flowchart that illustrates how “entities” such as people, objects
or concepts relate to each other within a system.
 ER Diagrams are most often used to design or debug relational databases.
 Also known as ERDs or ER Models, they use a defined set of symbols such as
rectangles, diamonds, ovals and connecting lines to depict the
interconnectedness of entities, relationships and their attributes.

Dimensional Modelling
 Dimensional modeling is widely accepted as the preferred technique for
presenting analytic data because it addresses two simultaneous requirements:
 Deliver data that’s understandable to the business users.
 Deliver fast query performance.
 Both 3NF and dimensional models can be represented in ERDs because both
consist of joined relational tables; the key difference between 3NF and
dimensional models is the degree of normalization.
 Normalized 3NF structures divide data into many discrete entities, each of
which becomes a relational table. A database of sales orders might start with
a record for each order line but turn into a complex spider web diagram as a
3NF model, perhaps consisting of hundreds of normalized tables.

Dimensional Modelling (contd.)
 Normalized 3NF structures are immensely useful in operational processing
(OLTP) because an update or insert transaction touches the database in only
one place.
 Normalized models, however, are too complicated for BI queries. Users can’t
understand, navigate, or remember normalized models that resemble a map
of a city.
 The complexity of users’ unpredictable queries overwhelms the database
optimizers, resulting in disastrous query performance.
 Fortunately, dimensional modeling addresses the problem of overly complex
schemas in the presentation area.
 A dimensional model contains the same information as a normalized model,
but packages the data in a format that delivers user understandability, query
performance, and resilience to change.

– Star Schema
 Dimensional models
implemented in relational
database management systems
are referred to as star schemas
because of their resemblance to
a star-like structure.
 The downside of dimensional
modelling is that you pay a load
performance price for these
capabilities, especially with large
data sets.

– Fact Tables
 The fact table in a dimensional
model stores the performance
measurements resulting from an
organization’s business process
events.
 You should strive to store the low-
level measurement data resulting
from a business process in a single
dimensional model (grain ex: "balls
per innings").
 Imagine standing in the marketplace
watching products being sold and
writing down the unit quantity and
dollar sales amount for each product
in each sales transaction.

Dimensional Modelling – Fact Tables
(contd.)
 Types of facts additive, semi-additive, non-additive:
 Additive - Additive measures can be summed across any of the dimensions
associated with the fact table. (sales amount)
 Semi-Additive – They can be summed across some dimensions, but not all; balance
amounts are common semi-additive facts because they are additive across all
dimensions except time
 Non – Additive - Some measures are completely non-additive, such as ratios,
percentages and percentiles.
 Despite their sparsity, fact tables usually make up 90 percent or more of the
total space consumed by a dimensional model.
 Fact tables tend to be deep in terms of the number of rows, but narrow in
terms of the number of columns.

– Dimension Tables
 The dimension tables contain the textual
context associated with a business process
measurement event.
 They describe the “who, what, where, when,
how, and why” associated with the event.
 Dimension tables often have many columns or
attributes.
 Dimension tables tend to have fewer rows
than fact tables, but can be wide with many
large text columns.
 Each dimension is defined by a single primary
key , which serves as the basis for referential
integrity with any given fact table to which it
is joined.

– Dimension Tables
 Dimension attributes serve as the primary
source of query constraints, groupings, and
report labels.
 Dimension attributes are critical to making
the DW/BI system usable and understandable.
 The analytic power of the DW/BI environment
is directly proportional to the quality and
depth of the dimension attributes.
 Instead of third normal form, dimension
tables typically are highly denormalized with
flattened many-to-one relationships within a
single dimension table.
 We can almost always trade off dimension
table space for simplicity and accessibility.

Facts and Dimensions joined in a star
schema
 The first thing to notice about the dimensional schema is its simplicity and
symmetry.
 The charm of the design in is that it is highly recognizable to business users.
 Furthermore, the reduced number of tables and use of meaningful business
descriptors make it easy to navigate and less likely that mistakes will occur.
 The simplicity of a dimensional model also has performance benefits.
 Database optimizers process these simple schemas with fewer joins more
efficiently.
 Dimension attributes supply the report filters and labeling, whereas the fact
tables supply the report’s numeric values.

Facts and Dimensions
joined in a star
schema
SELECT st.district_name,
pd.brand,
SUM(sf.sales_dollars) AS "Sales
Dollars"
FROM store st,--dimension table
product pd,--dimension table
DATE dt,--dimension table
sales_facts sf --fact table
WHERE dt.month_name ="January"
AND dt.year =2013
AND st.store_key = sf.store_key
AND pd.product_key = sf.product_key
AND dt.date_key = sf.date_key
GROUP BY st.district_name, pd.brand

Slowly changing dimensions
 It is a dimension that stores and manages both current and historical data
over time in a data warehouse. It is considered and implemented as one of
the most critical ETL tasks in tracking the history of dimension records.
 There are three types of SCDs and you can use Warehouse Builder to define,
deploy, and load all three types of SCDs.
 Type 1 SCDs - Overwriting
 Type 2 SCDs - Creating another dimension record
 Type 3 SCDs - Creating a current value field

Slowly Changing
Dimensions (contd.)
 Type 1 -Overwriting the old
value. In this method no history
of dimension changes is kept in
the database. The old dimension
value is simply overwritten be
the new one.

Slowly Changing
Dimensions (contd.)
 Type 2 - Creating a new
additional record. In this
methodology all history of
dimension changes is kept in the
database. Changes in the
attributes are captured by
adding a new row with a new
surrogate key to the dimension
table.

Slowly Changing
Dimensions (contd.)
 Type 3 - Adding a new column. In
this type usually only the current
and previous value of dimension
is kept in the database. The new
value is loaded into 'current/new'
column and the old one into
'old/previous' column. Generally
speaking the history is limited to
the number of column created
for storing historical data. This is
the least commonly needed
technique.

Data Marts
 A data mart is focused on a single functional
area of an organization and contains a subset
of data stored in a Data Warehouse.
 A data mart is a condensed version of Data
Warehouse and is designed for use by a
specific department, unit or set of users in an
organization. E.g., Marketing, Sales, HR or
finance.
 It is often controlled by a single department
in an organization.
 Data Mart usually draws data from only a few
sources compared to a Data warehouse.
 Data marts are small in size and are more
flexible compared to a Data warehouse.

A glimpse of Data Warehouse
Architecture

Data Warehouse Architecture (Inmon)

Data Warehouse Architecture (Kimball)

Data Warehouse Architectures (In Detail)
 Topics:
 Independent Data Mart Architecture
 Kimball Architecture
 Inmon Architecture (Hub-and-Spoke Corporate Information Factory)
 Key differences between Inmon and Kimball architectures

Independent Data
Mart Architecture
 Data is deployed on a departmental
basis without concern to sharing and
integrating information across the
enterprise.
 Its less recommended but this approach
is prevalent, especially in large
organizations.
 It’s the path of least resistance for fast
development at relatively low cost, at
least in the short run.
 Multiple uncoordinated extracts from
the same operational sources and
redundant storage of analytic data are
inefficient and wasteful in the long run.

Kimball Architecture
 There are four separate and distinct components to consider in the DW/BI
environment:
 Operational source systems (OLTP)
 ETL system
 Data presentation area (Dimensional Model)
 Business intelligence applications.

Kimball Architecture (contd.)
 Operational Source Systems:
 These are the operational systems of record that capture the business’s
transactions (OLTP)
 Extract, Transformation, and Load System:
 Extraction is the first step in the process of getting data into the data warehouse
environment
 There are numerous potential transformations, such as :
 Cleansing the data. Ex converting to standard date formats or correcting the misspellings.
 Combining data from multiple sources, and de-duplicating data.
 Primary mission of the ETL system is to hand off the dimension and fact tables in
the delivery step, these subsystems are critical.

Kimball Architecture (contd.)
 Data presentation area:
 The dimensional model with the star schema which is the byproduct of the whole
ETL design.
 Business Intelligence Applications:
 By definition, all BI applications query the data in the DW/BI presentation area to
generate the data in more meaningful and descriptive form using various reporting
techniques like pie charts, bar graphs, cross tabs etc.

Hub and spoke
Corporate Information
Factory (Inmon)
 This model identifies the key subject areas,
and most importantly, the key entities the
business operates with and cares about, like
customer, product, vendor, etc.
 Detailed logical model is created for each
major entity. Ex: A logical model will be built
for Customer with all the details related to
that entity.
 The key point here is that the entity structure
is built in normalized form (3NF), data
redundancy is avoided as much as possible.
 Data marts specific for departments are built
on top of the 3NF model and the data marts
can have de-normalized data to help with
reporting/ fast querying.

Kimball vs Inmon Architectures
Characteristics Favours Kimball Favours Inmon
Business decision support
requirements
Tactical Strategic
Data integration
requirements
Individual business
requirements
Enterprise-wide integration
The structure of data
KPI, business performance
measures, scorecards…
Data that meet multiple and
varied information needs
and non-metric data
Persistence of data in
source systems
Source systems are quite
stable
Source systems have high
rate of change
Skill sets Small team of generalists Bigger team of specialists
Time constraint
Urgent needs for the first
data warehouse
Longer time is allowed to
meet business’ needs.
Cost to build Low start-up cost High start-up costs

Refrences
 The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling,
2nd Edition Ralph Kimball, Margy Ross
 Introduction to Data Warehousing Concepts docs.oracle.com
 A Short History of Data Warehousing – DATAVERSITY www.dataversity.net
 Data integration – Wikipedia en.wikipedia.org
 Data Warehouse Design: The Good, the Bad, the Ugly blog.panoply.io
 ER Model Basic Conceptswww.tutorialspoint.com
 Oracle ATG Web Commerce - Search ERDdocs.oracle.com
 Document Entities | Oracle Magazineblogs.oracle.com

Introduction to Data Warehousing

More Related Content

What's hot (20)

Similar to Introduction to Data Warehousing (20)

Recently uploaded (20)

Introduction to Data Warehousing