SlideShare a Scribd company logo
Introductionn

eacweber@gmail.com
Data Vault Definition

The Data Vault is a detail oriented, historical tracking and uniquely
linked set of normalized tables that support one or more functional
areas of business. It is a hybrid approach encompassing the best
of breed between 3rd normal form (3NF) and star schema. The
design is flexible, scalable, consistent and adaptable to the needs
of the enterprise. It is a data model that is architected specifically
to meet the needs of enterprise data warehouses.



Source: Dan Linstedt
http://www.tdan.com/view-articles/5054/
Data Vault Building Blocks
                                                                  different sources/rate of change




Source: Dan Linstedt
http://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012
Data Vault Fundamentals: Hub




Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Link




Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Satellite




   Source: data-vault-modeling-guide
   GENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Model




Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
Data Vault ETL
Many objects to load, standardized procedures
This screams for a generic solution!
I don't want to:
  throw ETL tool away and code it all myself
  manage too many ETL objects
  connect similar columns in mappings by hand
I do want to:
  generate ETL (Kettle) objects? No
  Take it one step further: there's only 1 parameterised hub load
  object. Don't need to know xml structure of PDI objects
Tools
Operating System                Database


  Virtualization

                   Data Integration       'Productivity'

Sql Development
                                      Version Control
Place of framework in architecture


  Files

            MySQL      ETL:
                      Kettle   MySQL
 DBMS ETL              Data     Data   ETL

                      Vault    Vault
             CSV      Frame
             Files     work
  ERP       Staging             Central DWH &
             Area                 Data Marts

Sources     ETL Process        Data Warehouse   EUL
What has to be taken care of?

Data Vault designed and implemented in database
Staging tables and loading procedures in place
(can also be generic, we use PDI Metadata Injection step for loading
files)

Mapping from source to Data Vault specified
(now in an Excel sheet)
Framework components

PDI repository (file based), jobs and transformations
Configuration files:
kettle.properties
shared.xml
repositories.xml

Excel sheet that contains the specifications
MySQL database for metadata
Virtual machine with Ubuntu 12.04 Server
Design decisions

Updateable views with generic column names
(MySQL more lenient than PostgreSQL)
Compare satellite attributes via string comparison
(concatenate all columns, with | (pipe) as delimiter)

'inject' the metadata using Kettle parameters
Generate and use an error table for each Data Vault
table
Metadata tables




All have history tables
Metadata in Excel
                    Data Vault

                    connections

                    source systems



                    source tables
Metadata in Excel (hub + sat)




          x 200 (max)
Metadata in Excel (link)




                         x 10

link attributes
Metadata in Excel (link satellite)



                  x 10


                  x5

  x 200 (max)
Last seen date

applicable for hubs and links
existing hubs and links: update 'last_seen_dts'!
Link validity satellite
Link has 'business key': not all hub id's
Loading the metadata
'design errors'

Checks to avoid debugging:
(compares design metadata with Data Vault DB information_schema)


  hubs, links, satellites that don't exist in the DV
  key columns that do not exist in the DV
  missing connection data (source db)
  missing attribute columns
A complete run
Metadata needed for a hub

name
key column
business key column
source table
source table business key column
(can be expression, e.g. concatenate for composite key)
Job for hub
Transformation for hub
Metadata needed for a link
name
key column
for each hub (maximum 10, can be a ref-table)
   hub name
   column name for the hub key in the link (roles!)
   column in the source table → business key of hub


link 'attributes' (part of key, no hub, maximum 5)
link validity satellite needed?
last seen date needed?
source table
Job for link
Transformation for link




                  Last seen?
Lookup hubs

                               Remove columns not in link



                               Run table needed for
                               validity sat ?
Metadata needed for a hub satellite

  name
  key column
  hub name
  column in the source table → business key of hub
  for each attribute (maximum 200)
    source column
    target column
  source table
Job for hub satellite
Transformation for hub satellite
Metadata needed for a link satellite

name
key column
link name
for each hub of the link:
column in the source table → business key of hub
for each key attribute: source column
for each attribute: source column → target column
source table
Job for link satellite
Transformation for link satellite
Executing in a loop ..
.. and parallel
Logging


Custom logging




                                   PDI logging



Configuring log tables
for concurrent access
Version Control: PDI objects
Version Control: database objects
Some points of interest

Easy to make mistake in design sheet
Generic → a bit harder to maintain and debug
Application/tool to maintain metadata?
Data Vault generators (e.g. Quipu)?
Spinoff using Informatica and Oracle: Sander Robijns
Thanks to: Jos van Dongen
            Kasper de Graaf
Sourceforge!

More Related Content

PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PPTX
Introduction to HiveQL
Be A Hero: Transforming GoPro Analytics Data Pipeline
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
U-SQL Partitioned Data and Tables (SQLBits 2016)
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Introduction to HiveQL

What's hot (20)

PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
PPTX
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
PPTX
U-SQL Query Execution and Performance Tuning
PPTX
Spark meetup v2.0.5
PPTX
Killer Scenarios with Data Lake in Azure with U-SQL
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
PDF
20140908 spark sql & catalyst
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
PPTX
Using C# with U-SQL (SQLBits 2016)
PPTX
Hive and HiveQL - Module6
PPT
Benedutch 2011 ew_ppt
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
PDF
Spark SQL with Scala Code Examples
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
Data Source API in Spark
PDF
Introduction to Spark SQL & Catalyst
PPTX
Introducing U-SQL (SQLPASS 2016)
PDF
Flex Tables Guide Software V. 7.0.x
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
U-SQL Query Execution and Performance Tuning
Spark meetup v2.0.5
Killer Scenarios with Data Lake in Azure with U-SQL
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
20140908 spark sql & catalyst
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)
Hive and HiveQL - Module6
Benedutch 2011 ew_ppt
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
ADL/U-SQL Introduction (SQLBits 2016)
Spark SQL with Scala Code Examples
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Data Source API in Spark
Introduction to Spark SQL & Catalyst
Introducing U-SQL (SQLPASS 2016)
Flex Tables Guide Software V. 7.0.x
Ad

Similar to Presentation pdi data_vault_framework_meetup2012 (20)

DOCX
Data Vault: What is it? Where does it fit? SQL Saturday #249
PPTX
Introduction To Data Vault - DAMA Oregon 2012
PPTX
Visual Data Vault
PDF
Introduction to data vault ilja dmitrijev
PDF
Experiences from a Data Vault Pilot Exploiting the Internet of Things
PDF
Experiences from a Data Vault Pilot Exploiting the Internet of Things
PPTX
Agile Data Engineering - Intro to Data Vault Modeling (2016)
PDF
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
PDF
Meetup 25/04/19: Big Data
PDF
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
PPT
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
PDF
Why Data Vault?
PPTX
An Introduction To BI
PDF
Introduction to Data Vault Modeling
PPTX
Dancing with the Elephant
PPTX
Data Vault Overview
PPTX
CWIN 17 / sessions data vault modeling - f2-f - nishat gupta
PPTX
Datawarehouse
PPTX
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
PDF
Lean Data Warehouse via Data Vault
Data Vault: What is it? Where does it fit? SQL Saturday #249
Introduction To Data Vault - DAMA Oregon 2012
Visual Data Vault
Introduction to data vault ilja dmitrijev
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Agile Data Engineering - Intro to Data Vault Modeling (2016)
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Meetup 25/04/19: Big Data
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Why Data Vault?
An Introduction To BI
Introduction to Data Vault Modeling
Dancing with the Elephant
Data Vault Overview
CWIN 17 / sessions data vault modeling - f2-f - nishat gupta
Datawarehouse
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Lean Data Warehouse via Data Vault
Ad

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
MIND Revenue Release Quarter 2 2025 Press Release
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25-Week II
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Machine Learning_overview_presentation.pptx
Spectroscopy.pptx food analysis technology

Presentation pdi data_vault_framework_meetup2012

  • 2. Data Vault Definition The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses. Source: Dan Linstedt http://www.tdan.com/view-articles/5054/
  • 3. Data Vault Building Blocks different sources/rate of change Source: Dan Linstedt http://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012
  • 4. Data Vault Fundamentals: Hub Source: data-vault-modeling-guide GENESEE ACADEMY, LLC, Hans Hultgren
  • 5. Data Vault Fundamentals: Link Source: data-vault-modeling-guide GENESEE ACADEMY, LLC, Hans Hultgren
  • 6. Data Vault Fundamentals: Satellite Source: data-vault-modeling-guide GENESEE ACADEMY, LLC, Hans Hultgren
  • 7. Data Vault Fundamentals: Model Source: data-vault-modeling-guide GENESEE ACADEMY, LLC, Hans Hultgren
  • 8. Data Vault ETL Many objects to load, standardized procedures This screams for a generic solution! I don't want to: throw ETL tool away and code it all myself manage too many ETL objects connect similar columns in mappings by hand I do want to: generate ETL (Kettle) objects? No Take it one step further: there's only 1 parameterised hub load object. Don't need to know xml structure of PDI objects
  • 9. Tools Operating System Database Virtualization Data Integration 'Productivity' Sql Development Version Control
  • 10. Place of framework in architecture Files MySQL ETL: Kettle MySQL DBMS ETL Data Data ETL Vault Vault CSV Frame Files work ERP Staging Central DWH & Area Data Marts Sources ETL Process Data Warehouse EUL
  • 11. What has to be taken care of? Data Vault designed and implemented in database Staging tables and loading procedures in place (can also be generic, we use PDI Metadata Injection step for loading files) Mapping from source to Data Vault specified (now in an Excel sheet)
  • 12. Framework components PDI repository (file based), jobs and transformations Configuration files: kettle.properties shared.xml repositories.xml Excel sheet that contains the specifications MySQL database for metadata Virtual machine with Ubuntu 12.04 Server
  • 13. Design decisions Updateable views with generic column names (MySQL more lenient than PostgreSQL) Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter) 'inject' the metadata using Kettle parameters Generate and use an error table for each Data Vault table
  • 14. Metadata tables All have history tables
  • 15. Metadata in Excel Data Vault connections source systems source tables
  • 16. Metadata in Excel (hub + sat) x 200 (max)
  • 17. Metadata in Excel (link) x 10 link attributes
  • 18. Metadata in Excel (link satellite) x 10 x5 x 200 (max)
  • 19. Last seen date applicable for hubs and links existing hubs and links: update 'last_seen_dts'!
  • 20. Link validity satellite Link has 'business key': not all hub id's
  • 22. 'design errors' Checks to avoid debugging: (compares design metadata with Data Vault DB information_schema) hubs, links, satellites that don't exist in the DV key columns that do not exist in the DV missing connection data (source db) missing attribute columns
  • 24. Metadata needed for a hub name key column business key column source table source table business key column (can be expression, e.g. concatenate for composite key)
  • 27. Metadata needed for a link name key column for each hub (maximum 10, can be a ref-table) hub name column name for the hub key in the link (roles!) column in the source table → business key of hub link 'attributes' (part of key, no hub, maximum 5) link validity satellite needed? last seen date needed? source table
  • 29. Transformation for link Last seen? Lookup hubs Remove columns not in link Run table needed for validity sat ?
  • 30. Metadata needed for a hub satellite name key column hub name column in the source table → business key of hub for each attribute (maximum 200) source column target column source table
  • 31. Job for hub satellite
  • 33. Metadata needed for a link satellite name key column link name for each hub of the link: column in the source table → business key of hub for each key attribute: source column for each attribute: source column → target column source table
  • 34. Job for link satellite
  • 36. Executing in a loop ..
  • 38. Logging Custom logging PDI logging Configuring log tables for concurrent access
  • 41. Some points of interest Easy to make mistake in design sheet Generic → a bit harder to maintain and debug Application/tool to maintain metadata? Data Vault generators (e.g. Quipu)? Spinoff using Informatica and Oracle: Sander Robijns Thanks to: Jos van Dongen Kasper de Graaf