Designing a schema
for a Data Warehouse
Why a Data
Warehouse?
DWH
A company data is scattered over:
● Different databases
● Internal applications
● SaaS applications
The latter can be accessible via APIs or downloadable files
Why a DWH
This means:
● Different wire protocols, query languages
● Different schemas, methodically UNdocumented
● Designed to retrieve a single row, not aggregations
● On a technology designed for OLTP
● Conflicting / redundant / incorrect / missing data
● Business metrics are mixed with PII
Why a DWH
Instead, you want data analysts to…
● Connect to a signle SQL database
● With a well-known, standard schema
● Designed for analytical queries
● On a technology designed to run analytical queries
Why a DWH
This standard schema is designed for analytics queries:
● JOIN
● WHERE
● GROUP BY
Why a DWH
Why a DWH
It's called a Star Schema. Its most basic concepts are:
● A star represents an event: customer buys product
● A dimension is any event characteristic we might use for
filtering and ordering: purchase date, delivery date, product
name, product category, customer city…
● The grain defined how specific dimensions are: date or
month? city or postcode?
● Facts are the measurements we take: cost, discount,
number or product bought, etc
Why a DWH
DWH design
Designing a DWH is a technological activity
that requires some business knowledge
How to design
FALSE!!!
How to design
FALSE!!!
How to design
FALSE!!!
How to design
Designing a DWH is a business activity
that requires technical skills
How to design
It starts by identifying business processes you want to have
more information about
Example processes:
● A customer buys a product
● A Google Ads campaign runs
● A courier delivers a pizza
How to design
● While doing so, write a dictionary of business terms
● Everyone should understand the terms
● In many companies different teams use some terms with
different meanings
How to design
Discuss each event with all the people who need information
about it
Typically people from multiple departments
How to design
Find out:
● Facts - numerical measurements to take (cost, discount,
number or product bought, etc)
● Dimensions - Event characteristics that can be used for
filtering and grouping: purchase date, delivery date, store,
product name, product category, customer city…
● The grain defines how specific dimensions are: date or
month? city or postcode?
How to design
Modify the event statement by adding the time and the
dimensions that affect its grain
● A customer buys a product
● Customers buy products in a day, in a city
● Customers buy products in a month, in a country
How to design
Dimensions
Dimensions are the criterias that will be used to
● Aggregate
● Filter
● Order
the numbers.
Dimensions
Example:
● Average amount spent
● By customers over 40, in 2024, in France
● Aggregated by store_city, date
Dimensions
Table: fact_in_store_purchase
Dimensions
date country city customer_dob prod_count total_price
2024/01/15 FR Paris 1950/02/02 2 150.00
2024/01/15 FR Paris 1952/02/02 1 999.99
2024/01/15 FR Avignon 1962/10/04 1 22.50
2024/01/16 FR Paris 1978/12/02 2 10.00
2024/01/16 FR Nice 1977/11/09 1 199.50
With this simplistic design:
● Adding dimensional columns is a pain
● Loading data into the table is harder
● We can't query a dimension alone
● We can't get a list of things that didn't happen
Dimensions
Dimensions
Dimensions usually look like this:
● Stored in separate table
● Denormalised.
Hierarchies are represented by repeating data
● They have an ID that is unique to the DWH and has no
meaning
● Human readable information is stored in other columns
● Which are indexed
Dimensions
Table: dim_city
Dimensions
continent country city local_name language population
Europe Italy Rome Roma it 10000
Europe Italy Milan Milano it 20000
Europe Italy Alghero Alghero it 30000
Europe Italy Alghero Alghero ca 30000
Europe Scotland Edinburgh Edinburgh en 12345
Facts
A fact table usually contains:
● References / foreign keys to Dimension tables
● One or more numeric columns (facts)
Facts
Table: fact_in_store_purchase
Facts
date country city customer_dob prod_count total_price
20240115 15 32 19500202 2 150.00
20240115 15 32 19520202 1 999.99
20240115 15 44 19621004 1 22.50
20240116 15 32 19781202 2 10.00
20240116 15 71 19771109 1 199.50
To join fact to dimensions:
SELECT f.*, dt.date, ct.city
FROM fact_in_store_purchase f
NATURAL JOIN dim_date dt
NATURAL JOIN dim_country ct
NATURAL JOIN dim_customer cu
WHERE dt.date > 20240000
AND dt.week_day BETWEEN 1 AND 5
AND ct.country = 'France'
AND cu.dob BETWEEN 19800000 AND 20000000
GROUP BY dt.date, ct.city
ORDER BY dt.date, ct.city
Facts
There are 3 types of fact tables:
● Transaction fact tables
○ The company buys products
● Periodic snapshots fact tables
○ Monthly inventory
● Accumulating snapshots fact tables
○ Multi-step: courier delivers pizza
Facts
Factless fact tables are a special type of fact tables.
They don't have any fact column.
They are boolean facts: an existing row is TRUE, a non-existing
row is FALSE.
Facts
Table: fact_customer_care_call
Facts
date customer_id operator_id
20240201 87612 927
20240201 999111 2250
20240201 8825 822
20240202 19166 1002
20240202 38410 948
Time Dimensions
General rules for time dimensions:
● One dimension for date only, without time
● Primary key: an integer id in the form yyyymmdd
● Add columns for any significant information: year, month,
month day, week day, workday, leap year…
Facts
Store day time in a separate column, if needed
● Primary key: integer id in the format hhmm
● Add separate columns for hours, minutes and any other
information you might need
● Depending on your needs, add a row for every minute, or
hour in the working hours, or half an hour, etc
Facts
Constellation
schemas
● You typically have multiple star schemas linked together
(Constellation Schema)
● Most dimensions should be shared across multiple stars
(Conformed Dimensions)
● Two stars might represent the same data with different
granularity, so some facts are present in multiple tables
● Make sure that facts are names consistently across stars
(Conformed Facts)
Constellation schemas
But DWH is a
complex matter…
We left out many topics, for example…
● How to represent invoice or bill of lading dimensions
(1 invoice contains multiple items)
● How to represent dimensions that change over time
● Role playing dimensions and other dimension types
● Data marts, data lakes
● DWH to feed Machine Learning
● …and more
Interested? Contact us for a training!
What we left out

More Related Content

DOCX
PPT
Dimensional modelling-mod-3
PPTX
IT301-Datawarehousing (1) and its sub topics.pptx
PDF
Data Warehouse Design & Dimensional Modeling
PPTX
Introduction Data warehouse
PPT
Dimensional Modeling
PPT
Dimensional Modeling Concepts_Nishant.ppt
PPT
Dimensional Modeling For engineering drawings.ppt
Dimensional modelling-mod-3
IT301-Datawarehousing (1) and its sub topics.pptx
Data Warehouse Design & Dimensional Modeling
Introduction Data warehouse
Dimensional Modeling
Dimensional Modeling Concepts_Nishant.ppt
Dimensional Modeling For engineering drawings.ppt

Similar to Webinar: Designing a schema for a Data Warehouse (20)

PDF
1 introductory slides (1)
PDF
Data Warehouse Basics
PPTX
Data ware house design
PPTX
Data ware house design
PDF
Data Warehouse Back to Basics: Dimensional Modeling
PDF
Data Warehouse Design and Best Practices
PPTX
Module 1.2: Data Warehousing Fundamentals.pptx
PPT
An introduction to data warehousing
PPTX
Data Warehouse by Amr Ali
PPTX
The Data Warehouse Lifecycle
PPT
Dimensional Modeling
ODP
Dimensional Modelling
PPTX
1.2 CLASS-DW.pptx-data warehouse design and development
PPT
Business Intelligence: A Review
PDF
LECTURE 7.ppt.pdf
DOC
Basics+of+Datawarehousing
PDF
Olap fundamentals
PDF
Schema_______________Types__________.pdf
PPTX
Unit 2- Data Warehouse Logical Design.pptx
PDF
Data Warehousing concepts for Data Engineering
1 introductory slides (1)
Data Warehouse Basics
Data ware house design
Data ware house design
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Design and Best Practices
Module 1.2: Data Warehousing Fundamentals.pptx
An introduction to data warehousing
Data Warehouse by Amr Ali
The Data Warehouse Lifecycle
Dimensional Modeling
Dimensional Modelling
1.2 CLASS-DW.pptx-data warehouse design and development
Business Intelligence: A Review
LECTURE 7.ppt.pdf
Basics+of+Datawarehousing
Olap fundamentals
Schema_______________Types__________.pdf
Unit 2- Data Warehouse Logical Design.pptx
Data Warehousing concepts for Data Engineering
Ad

More from Federico Razzoli (20)

PDF
MariaDB Data Protection: Backup Strategies for the Real World
PDF
MariaDB/MySQL_: Developing Scalable Applications
PDF
High-level architecture of a complete MariaDB deployment
PDF
Webinar - Unleash AI power with MySQL and MindsDB
PDF
MariaDB Security Best Practices
PDF
A first look at MariaDB 11.x features and ideas on how to use them
PDF
MariaDB stored procedures and why they should be improved
PDF
Webinar - MariaDB Temporal Tables: a demonstration
PDF
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
PDF
MariaDB 10.11 key features overview for DBAs
PDF
Recent MariaDB features to learn for a happy life
PDF
Advanced MariaDB features that developers love.pdf
PDF
Automate MariaDB Galera clusters deployments with Ansible
PDF
Creating Vagrant development machines with MariaDB
PDF
MariaDB, MySQL and Ansible: automating database infrastructures
PDF
Playing with the CONNECT storage engine
PDF
MariaDB Temporal Tables
PDF
Database Design most common pitfalls
PDF
MySQL and MariaDB Backups
PDF
JSON in MySQL and MariaDB Databases
MariaDB Data Protection: Backup Strategies for the Real World
MariaDB/MySQL_: Developing Scalable Applications
High-level architecture of a complete MariaDB deployment
Webinar - Unleash AI power with MySQL and MindsDB
MariaDB Security Best Practices
A first look at MariaDB 11.x features and ideas on how to use them
MariaDB stored procedures and why they should be improved
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
MariaDB 10.11 key features overview for DBAs
Recent MariaDB features to learn for a happy life
Advanced MariaDB features that developers love.pdf
Automate MariaDB Galera clusters deployments with Ansible
Creating Vagrant development machines with MariaDB
MariaDB, MySQL and Ansible: automating database infrastructures
Playing with the CONNECT storage engine
MariaDB Temporal Tables
Database Design most common pitfalls
MySQL and MariaDB Backups
JSON in MySQL and MariaDB Databases
Ad

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
Geologic Time for studying geology for geologist
PDF
Getting Started with Data Integration: FME Form 101
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Unlock new opportunities with location data.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPT
What is a Computer? Input Devices /output devices
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Tartificialntelligence_presentation.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Enhancing emotion recognition model for a student engagement use case through...
Geologic Time for studying geology for geologist
Getting Started with Data Integration: FME Form 101
Univ-Connecticut-ChatGPT-Presentaion.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Unlock new opportunities with location data.pdf
Developing a website for English-speaking practice to English as a foreign la...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Final SEM Unit 1 for mit wpu at pune .pptx
What is a Computer? Input Devices /output devices
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
O2C Customer Invoices to Receipt V15A.pptx
DP Operators-handbook-extract for the Mautical Institute
NewMind AI Weekly Chronicles – August ’25 Week III

Webinar: Designing a schema for a Data Warehouse

  • 1. Designing a schema for a Data Warehouse
  • 3. DWH
  • 4. A company data is scattered over: ● Different databases ● Internal applications ● SaaS applications The latter can be accessible via APIs or downloadable files Why a DWH
  • 5. This means: ● Different wire protocols, query languages ● Different schemas, methodically UNdocumented ● Designed to retrieve a single row, not aggregations ● On a technology designed for OLTP ● Conflicting / redundant / incorrect / missing data ● Business metrics are mixed with PII Why a DWH
  • 6. Instead, you want data analysts to… ● Connect to a signle SQL database ● With a well-known, standard schema ● Designed for analytical queries ● On a technology designed to run analytical queries Why a DWH
  • 7. This standard schema is designed for analytics queries: ● JOIN ● WHERE ● GROUP BY Why a DWH
  • 9. It's called a Star Schema. Its most basic concepts are: ● A star represents an event: customer buys product ● A dimension is any event characteristic we might use for filtering and ordering: purchase date, delivery date, product name, product category, customer city… ● The grain defined how specific dimensions are: date or month? city or postcode? ● Facts are the measurements we take: cost, discount, number or product bought, etc Why a DWH
  • 11. Designing a DWH is a technological activity that requires some business knowledge How to design
  • 15. Designing a DWH is a business activity that requires technical skills How to design
  • 16. It starts by identifying business processes you want to have more information about Example processes: ● A customer buys a product ● A Google Ads campaign runs ● A courier delivers a pizza How to design
  • 17. ● While doing so, write a dictionary of business terms ● Everyone should understand the terms ● In many companies different teams use some terms with different meanings How to design
  • 18. Discuss each event with all the people who need information about it Typically people from multiple departments How to design
  • 19. Find out: ● Facts - numerical measurements to take (cost, discount, number or product bought, etc) ● Dimensions - Event characteristics that can be used for filtering and grouping: purchase date, delivery date, store, product name, product category, customer city… ● The grain defines how specific dimensions are: date or month? city or postcode? How to design
  • 20. Modify the event statement by adding the time and the dimensions that affect its grain ● A customer buys a product ● Customers buy products in a day, in a city ● Customers buy products in a month, in a country How to design
  • 22. Dimensions are the criterias that will be used to ● Aggregate ● Filter ● Order the numbers. Dimensions
  • 23. Example: ● Average amount spent ● By customers over 40, in 2024, in France ● Aggregated by store_city, date Dimensions
  • 24. Table: fact_in_store_purchase Dimensions date country city customer_dob prod_count total_price 2024/01/15 FR Paris 1950/02/02 2 150.00 2024/01/15 FR Paris 1952/02/02 1 999.99 2024/01/15 FR Avignon 1962/10/04 1 22.50 2024/01/16 FR Paris 1978/12/02 2 10.00 2024/01/16 FR Nice 1977/11/09 1 199.50
  • 25. With this simplistic design: ● Adding dimensional columns is a pain ● Loading data into the table is harder ● We can't query a dimension alone ● We can't get a list of things that didn't happen Dimensions
  • 27. Dimensions usually look like this: ● Stored in separate table ● Denormalised. Hierarchies are represented by repeating data ● They have an ID that is unique to the DWH and has no meaning ● Human readable information is stored in other columns ● Which are indexed Dimensions
  • 28. Table: dim_city Dimensions continent country city local_name language population Europe Italy Rome Roma it 10000 Europe Italy Milan Milano it 20000 Europe Italy Alghero Alghero it 30000 Europe Italy Alghero Alghero ca 30000 Europe Scotland Edinburgh Edinburgh en 12345
  • 29. Facts
  • 30. A fact table usually contains: ● References / foreign keys to Dimension tables ● One or more numeric columns (facts) Facts
  • 31. Table: fact_in_store_purchase Facts date country city customer_dob prod_count total_price 20240115 15 32 19500202 2 150.00 20240115 15 32 19520202 1 999.99 20240115 15 44 19621004 1 22.50 20240116 15 32 19781202 2 10.00 20240116 15 71 19771109 1 199.50
  • 32. To join fact to dimensions: SELECT f.*, dt.date, ct.city FROM fact_in_store_purchase f NATURAL JOIN dim_date dt NATURAL JOIN dim_country ct NATURAL JOIN dim_customer cu WHERE dt.date > 20240000 AND dt.week_day BETWEEN 1 AND 5 AND ct.country = 'France' AND cu.dob BETWEEN 19800000 AND 20000000 GROUP BY dt.date, ct.city ORDER BY dt.date, ct.city Facts
  • 33. There are 3 types of fact tables: ● Transaction fact tables ○ The company buys products ● Periodic snapshots fact tables ○ Monthly inventory ● Accumulating snapshots fact tables ○ Multi-step: courier delivers pizza Facts
  • 34. Factless fact tables are a special type of fact tables. They don't have any fact column. They are boolean facts: an existing row is TRUE, a non-existing row is FALSE. Facts
  • 35. Table: fact_customer_care_call Facts date customer_id operator_id 20240201 87612 927 20240201 999111 2250 20240201 8825 822 20240202 19166 1002 20240202 38410 948
  • 37. General rules for time dimensions: ● One dimension for date only, without time ● Primary key: an integer id in the form yyyymmdd ● Add columns for any significant information: year, month, month day, week day, workday, leap year… Facts
  • 38. Store day time in a separate column, if needed ● Primary key: integer id in the format hhmm ● Add separate columns for hours, minutes and any other information you might need ● Depending on your needs, add a row for every minute, or hour in the working hours, or half an hour, etc Facts
  • 40. ● You typically have multiple star schemas linked together (Constellation Schema) ● Most dimensions should be shared across multiple stars (Conformed Dimensions) ● Two stars might represent the same data with different granularity, so some facts are present in multiple tables ● Make sure that facts are names consistently across stars (Conformed Facts) Constellation schemas
  • 41. But DWH is a complex matter…
  • 42. We left out many topics, for example… ● How to represent invoice or bill of lading dimensions (1 invoice contains multiple items) ● How to represent dimensions that change over time ● Role playing dimensions and other dimension types ● Data marts, data lakes ● DWH to feed Machine Learning ● …and more Interested? Contact us for a training! What we left out