Introduction

                    Edwin Weber
                   Weber Solutions
                eacweber@gmail.com


Back end of Data Warehousing
MySQL, SQL Server, Oracle, PostgreSQL
PDI, SSIS, Oracle Warehouse Builder (long ago)
Project

Sint Antonius hospital in Utrecht, Open Source oriented
Chance to combine Kettle experience with a Data Vault
  (new to me)
Practically at the same time: project SSIS and Data Vault
So I jumped on the Data Vault bandwagon
Data Vault ETL


    Many objects to load, standardized procedures

    This screams for a generic solution

    I don't want to:
     
         manage too many Kettle objects
     
         connect similar columns in mappings by hand

    Solution:
     
         Generate Kettle objects?
     
         Or take it one step further, there's only 1 parameterised
         hub load object. Don't need to know xml structure of PDI
                                                                     3
         objects.
Goal


    Generic ETL to load a Data Vault

    Metadata driven

    No generation, 1 object for each Data Vault entity
    
        Hub
    
        Link
    
        Hub satellite
    
        Link satellite
    
        Define the mappings, create the Data Vault tables: done!
                                                                   4
Tools


    Ubuntu

    Pentaho Data Integration CE

    LibreOffice Calc

    MySQL 5.1

    Cookbook, doc generation by Roland Bouman

    (PostgreSQL 9.0, Oracle 11)



                                                5
Deliverables


    Set of PDI jobs and transformations

    Configuration files:
kettle.properties
shared.xml
repositories.xml

    Excel sheet that contains the specifications

    Scripts to generate/populate the pdi_meta and
    data_vault databases (or schemas)

                                                    6
Design decisions


    Updateable views with generic column names

    (MySQL more lenient than PostgreSQL)

    Compare satellite attributes via string comparison
    (concatenate all columns, with | (pipe) as delimiter)

    'inject' the metadata using Kettle parameters

    Generate and use an error table for each Data Vault
    table. Kettle handles the errors. Helps to find DV
    design conflicts, tables should contain few to none
    records in production.
                                                            7
Prerequisites


    Data Vault designed and implemented in database

    Staging tables and loading procedures in place
(can also be generic, we use PDI Metadata Injection step for loading files)

    Mapping from source to Data Vault specified
(now in an Excel sheet)




                                                                              8
Metadata tables




ref_data_vault_link_sources!




                               9
Design in LibreOffice (sources)




                                  10
Design in LibreOffice (hub + sat)




                                    11
Loading the metadata




                       12
'design errors'

Checks to avoid debugging:
(compares design metadata with Data Vault DB information_schema)



    hubs, links, satellites that don't exist in the DV

    key columns that do not exist in the DV

    missing connection data (source db)

    missing attribute columns


                                                                   13
A complete run




                 14
Spec: loading a hub

Load a hub, specified by:

    name

    key column

    business key column

    source table

    source table business key column
(can be expression, e.g. concatenate for composite key)


                                                          15
The Kettle objects: job hub




                              16
The Kettle objects: trf hub




                              17
Spec: loading a link

Load a link, specified by:

    name

    key column

    for each hub (maximum 10, can be a ref-table)
    
        hub name
    
        column name for the hub key in the link (roles!)
    
        column in the source table → business key of hub

    link 'attributes' (part of key, no hub, maximum 5)

    source table                                           18
The Kettle objects: job link




                               19
The Kettle objects: trf link

            Remove Unused 1 hub
            (peg-legged link)




                                  20
Spec: loading a hub satellite

Load a hub satellite, specified by:

    name

    key column

    hub name

    column in the source table → business key of hub

    for each attribute (maximum 200)
    
        source column
    
        target column

    source table                                       21
The Kettle objects: job hub sat




                                  22
The Kettle objects: trf hub sat




                                  23
Spec: loading a link satellite

Load a link satellite, specified by:

    name

    key column

    link name

    for each hub of the link:
column in the source table → business key of hub

    for each key attribute: source column

    for each attribute: source column → target column
                                                        24

    source table
Executing in a loop ..




                         25
.. and parallel




                  26
Logging

Default PDI logging enabled (e.g. errors)
N times 'generic job' is not so informative, so the jobs
  log:
   
       hub name
   
       link name
   
       hub satellite name
   
       link satellite name
   
       number of rows as start/end
   
       start/end time
                                                           27
Some points of interest


    Easy to make mistake in design sheet

    Generic → a bit harder to maintain and debug

    Application/tool to maintain metadata?

    Doc&%#$#@%tation (internals, checklists)




                                                   28
Availability of the code


    Free, because that's fair. I make a living with stuff that
    other people give away for free.

    Two flavours for now, MySQL and PostgreSQL.
    Oracle is 'under construction'.

    It's not on SourceForge, just mail me some Belgium
    beer and you get the code.




                                                             29

More Related Content

PDF
Materialize: a platform for changing data
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PDF
PDI data vault framework #pcmams 2012
PPT
Upgrading To The New Map Reduce API
PDF
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
PDF
SysDB — The system management and inventory collection service
PDF
Json in Postgres - the Roadmap
 
Materialize: a platform for changing data
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PDI data vault framework #pcmams 2012
Upgrading To The New Map Reduce API
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
SysDB — The system management and inventory collection service
Json in Postgres - the Roadmap
 

What's hot (20)

PDF
Dbvisit replicate: logical replication made easy
ODP
Cascalog internal dsl_preso
PDF
Spider HA 20100922(DTT#7)
PDF
All about Zookeeper and ClickHouse Keeper.pdf
PDF
Accessing Databases from R
PPTX
Nov. 4, 2011 o reilly webcast-hbase- lars george
PDF
Hive Quick Start Tutorial
PPTX
Ten tools for ten big data areas 04_Apache Hive
PDF
Developing and Deploying Apps with the Postgres FDW
PDF
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
PPTX
Calcite meetup-2016-04-20
PDF
Foreign Data Wrapper Enhancements
PDF
Advanced Sharding Techniques with Spider (MUC2010)
PDF
Setup oracle golden gate 11g replication
PDF
Writing A Foreign Data Wrapper
PDF
Oracle in-Memory Column Store for BI
PDF
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
PPTX
High Performance, High Reliability Data Loading on ClickHouse
PPTX
Power JSON with PostgreSQL
 
PPTX
Advanced Sqoop
Dbvisit replicate: logical replication made easy
Cascalog internal dsl_preso
Spider HA 20100922(DTT#7)
All about Zookeeper and ClickHouse Keeper.pdf
Accessing Databases from R
Nov. 4, 2011 o reilly webcast-hbase- lars george
Hive Quick Start Tutorial
Ten tools for ten big data areas 04_Apache Hive
Developing and Deploying Apps with the Postgres FDW
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Calcite meetup-2016-04-20
Foreign Data Wrapper Enhancements
Advanced Sharding Techniques with Spider (MUC2010)
Setup oracle golden gate 11g replication
Writing A Foreign Data Wrapper
Oracle in-Memory Column Store for BI
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
High Performance, High Reliability Data Loading on ClickHouse
Power JSON with PostgreSQL
 
Advanced Sqoop
Ad

Viewers also liked (6)

PDF
Why PostgreSQL for Analytics Infrastructure (DW)?
PDF
SeSQL : un moteur de recherche en Python et PostgreSQL
PDF
Really Big Elephants: PostgreSQL DW
PPTX
Data warehouse
PPTX
Chp2 - Les Entrepôts de Données
Why PostgreSQL for Analytics Infrastructure (DW)?
SeSQL : un moteur de recherche en Python et PostgreSQL
Really Big Elephants: PostgreSQL DW
Data warehouse
Chp2 - Les Entrepôts de Données
Ad

Similar to Benedutch 2011 ew_ppt (20)

PDF
Presentation pdi data_vault_framework_meetup2012
PPTX
Dancing with the Elephant
PDF
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
PDF
Initial review of Firebird 3
PDF
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PDF
Integration
PDF
World Domination with Pentaho EE?
PPTX
Fi nf068c73aef66f694f31a049aff3f4
PDF
Complex Event Processing: What?, Why?, How?
PPTX
Solving Compliance for Big Data
PPTX
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
PPTX
The Very Very Latest in Database Development - Oracle Open World 2012
PPT
Sem tech 2011 v8
PPTX
UNIT - 1 Part 2: Data Warehousing and Data Mining
PDF
2012 06-15-jazoon12-sub138-eranea-large-apps-migration
PPTX
Hafslund SESAM - Semantic integration in practice
ODT
General Logging Approach
PPTX
An Introduction To BI
PDF
Mow2012 data services
PPTX
Data Ingestion Engine
Presentation pdi data_vault_framework_meetup2012
Dancing with the Elephant
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
Initial review of Firebird 3
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
Integration
World Domination with Pentaho EE?
Fi nf068c73aef66f694f31a049aff3f4
Complex Event Processing: What?, Why?, How?
Solving Compliance for Big Data
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest in Database Development - Oracle Open World 2012
Sem tech 2011 v8
UNIT - 1 Part 2: Data Warehousing and Data Mining
2012 06-15-jazoon12-sub138-eranea-large-apps-migration
Hafslund SESAM - Semantic integration in practice
General Logging Approach
An Introduction To BI
Mow2012 data services
Data Ingestion Engine

Recently uploaded (20)

PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
A comparative study of natural language inference in Swahili using monolingua...
DOCX
search engine optimization ppt fir known well about this
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Tartificialntelligence_presentation.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
August Patch Tuesday
PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A novel scalable deep ensemble learning framework for big data classification...
Module 1.ppt Iot fundamentals and Architecture
Getting started with AI Agents and Multi-Agent Systems
WOOl fibre morphology and structure.pdf for textiles
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
A comparative study of natural language inference in Swahili using monolingua...
search engine optimization ppt fir known well about this
Hindi spoken digit analysis for native and non-native speakers
DP Operators-handbook-extract for the Mautical Institute
Tartificialntelligence_presentation.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Group 1 Presentation -Planning and Decision Making .pptx
observCloud-Native Containerability and monitoring.pptx
August Patch Tuesday
Benefits of Physical activity for teenagers.pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Chapter 5: Probability Theory and Statistics
A novel scalable deep ensemble learning framework for big data classification...

Benedutch 2011 ew_ppt

  • 1. Introduction Edwin Weber Weber Solutions eacweber@gmail.com Back end of Data Warehousing MySQL, SQL Server, Oracle, PostgreSQL PDI, SSIS, Oracle Warehouse Builder (long ago)
  • 2. Project Sint Antonius hospital in Utrecht, Open Source oriented Chance to combine Kettle experience with a Data Vault (new to me) Practically at the same time: project SSIS and Data Vault So I jumped on the Data Vault bandwagon
  • 3. Data Vault ETL  Many objects to load, standardized procedures  This screams for a generic solution  I don't want to:  manage too many Kettle objects  connect similar columns in mappings by hand  Solution:  Generate Kettle objects?  Or take it one step further, there's only 1 parameterised hub load object. Don't need to know xml structure of PDI 3 objects.
  • 4. Goal  Generic ETL to load a Data Vault  Metadata driven  No generation, 1 object for each Data Vault entity  Hub  Link  Hub satellite  Link satellite  Define the mappings, create the Data Vault tables: done! 4
  • 5. Tools  Ubuntu  Pentaho Data Integration CE  LibreOffice Calc  MySQL 5.1  Cookbook, doc generation by Roland Bouman  (PostgreSQL 9.0, Oracle 11) 5
  • 6. Deliverables  Set of PDI jobs and transformations  Configuration files: kettle.properties shared.xml repositories.xml  Excel sheet that contains the specifications  Scripts to generate/populate the pdi_meta and data_vault databases (or schemas) 6
  • 7. Design decisions  Updateable views with generic column names  (MySQL more lenient than PostgreSQL)  Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)  'inject' the metadata using Kettle parameters  Generate and use an error table for each Data Vault table. Kettle handles the errors. Helps to find DV design conflicts, tables should contain few to none records in production. 7
  • 8. Prerequisites  Data Vault designed and implemented in database  Staging tables and loading procedures in place (can also be generic, we use PDI Metadata Injection step for loading files)  Mapping from source to Data Vault specified (now in an Excel sheet) 8
  • 10. Design in LibreOffice (sources) 10
  • 11. Design in LibreOffice (hub + sat) 11
  • 13. 'design errors' Checks to avoid debugging: (compares design metadata with Data Vault DB information_schema)  hubs, links, satellites that don't exist in the DV  key columns that do not exist in the DV  missing connection data (source db)  missing attribute columns 13
  • 15. Spec: loading a hub Load a hub, specified by:  name  key column  business key column  source table  source table business key column (can be expression, e.g. concatenate for composite key) 15
  • 16. The Kettle objects: job hub 16
  • 17. The Kettle objects: trf hub 17
  • 18. Spec: loading a link Load a link, specified by:  name  key column  for each hub (maximum 10, can be a ref-table)  hub name  column name for the hub key in the link (roles!)  column in the source table → business key of hub  link 'attributes' (part of key, no hub, maximum 5)  source table 18
  • 19. The Kettle objects: job link 19
  • 20. The Kettle objects: trf link Remove Unused 1 hub (peg-legged link) 20
  • 21. Spec: loading a hub satellite Load a hub satellite, specified by:  name  key column  hub name  column in the source table → business key of hub  for each attribute (maximum 200)  source column  target column  source table 21
  • 22. The Kettle objects: job hub sat 22
  • 23. The Kettle objects: trf hub sat 23
  • 24. Spec: loading a link satellite Load a link satellite, specified by:  name  key column  link name  for each hub of the link: column in the source table → business key of hub  for each key attribute: source column  for each attribute: source column → target column 24  source table
  • 25. Executing in a loop .. 25
  • 27. Logging Default PDI logging enabled (e.g. errors) N times 'generic job' is not so informative, so the jobs log:  hub name  link name  hub satellite name  link satellite name  number of rows as start/end  start/end time 27
  • 28. Some points of interest  Easy to make mistake in design sheet  Generic → a bit harder to maintain and debug  Application/tool to maintain metadata?  Doc&%#$#@%tation (internals, checklists) 28
  • 29. Availability of the code  Free, because that's fair. I make a living with stuff that other people give away for free.  Two flavours for now, MySQL and PostgreSQL. Oracle is 'under construction'.  It's not on SourceForge, just mail me some Belgium beer and you get the code. 29