Benedutch 2011 ew_ppt

Introduction

Edwin Weber
Weber Solutions
eacweber@gmail.com

Back end of Data Warehousing
MySQL, SQL Server, Oracle, PostgreSQL
PDI, SSIS, Oracle Warehouse Builder (long ago)

Project

Sint Antonius hospital in Utrecht, Open Source oriented
Chance to combine Kettle experience with a Data Vault
(new to me)
Practically at the same time: project SSIS and Data Vault
So I jumped on the Data Vault bandwagon

Data Vault ETL


Many objects to load, standardized procedures

This screams for a generic solution

I don't want to:

manage too many Kettle objects

connect similar columns in mappings by hand

Solution:

Generate Kettle objects?

Or take it one step further, there's only 1 parameterised
hub load object. Don't need to know xml structure of PDI
3
objects.

Goal


Generic ETL to load a Data Vault

Metadata driven

No generation, 1 object for each Data Vault entity

Hub

Link

Hub satellite

Link satellite

Define the mappings, create the Data Vault tables: done!
4

Tools


Ubuntu

Pentaho Data Integration CE

LibreOffice Calc

MySQL 5.1

Cookbook, doc generation by Roland Bouman

(PostgreSQL 9.0, Oracle 11)

5

Deliverables


Set of PDI jobs and transformations

Configuration files:
kettle.properties
shared.xml
repositories.xml

Excel sheet that contains the specifications

Scripts to generate/populate the pdi_meta and
data_vault databases (or schemas)

6

Design decisions


Updateable views with generic column names

(MySQL more lenient than PostgreSQL)

Compare satellite attributes via string comparison
(concatenate all columns, with | (pipe) as delimiter)

'inject' the metadata using Kettle parameters

Generate and use an error table for each Data Vault
table. Kettle handles the errors. Helps to find DV
design conflicts, tables should contain few to none
records in production.
7

Prerequisites


Data Vault designed and implemented in database

Staging tables and loading procedures in place
(can also be generic, we use PDI Metadata Injection step for loading files)

Mapping from source to Data Vault specified
(now in an Excel sheet)

8

Metadata tables

ref_data_vault_link_sources!

9

Design in LibreOffice (sources)

10

Design in LibreOffice (hub + sat)

11

Loading the metadata

12

'design errors'

Checks to avoid debugging:
(compares design metadata with Data Vault DB information_schema)


hubs, links, satellites that don't exist in the DV

key columns that do not exist in the DV

missing connection data (source db)

missing attribute columns

13

Spec: loading a hub

Load a hub, specified by:

name

key column

business key column

source table

source table business key column
(can be expression, e.g. concatenate for composite key)

15

The Kettle objects: job hub

16

The Kettle objects: trf hub

17

Spec: loading a link

Load a link, specified by:

name

key column

for each hub (maximum 10, can be a ref-table)

hub name

column name for the hub key in the link (roles!)

column in the source table → business key of hub

link 'attributes' (part of key, no hub, maximum 5)

source table 18

The Kettle objects: job link

19

The Kettle objects: trf link

Remove Unused 1 hub
(peg-legged link)

20

Spec: loading a hub satellite

Load a hub satellite, specified by:

name

key column

hub name


for each attribute (maximum 200)

source column

target column

source table 21

The Kettle objects: job hub sat

22

The Kettle objects: trf hub sat

23

Spec: loading a link satellite

Load a link satellite, specified by:

name

key column

link name

for each hub of the link:

for each key attribute: source column

for each attribute: source column → target column
24

source table

Executing in a loop ..

25

Logging

Default PDI logging enabled (e.g. errors)
N times 'generic job' is not so informative, so the jobs
log:

hub name

link name

hub satellite name

link satellite name

number of rows as start/end

start/end time
27

Some points of interest


Easy to make mistake in design sheet

Generic → a bit harder to maintain and debug

Application/tool to maintain metadata?

Doc&%#$#@%tation (internals, checklists)

28

Availability of the code


Free, because that's fair. I make a living with stuff that
other people give away for free.

Two flavours for now, MySQL and PostgreSQL.
Oracle is 'under construction'.

It's not on SourceForge, just mail me some Belgium
beer and you get the code.

29

Benedutch 2011 ew_ppt

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Benedutch 2011 ew_ppt (20)

Recently uploaded (20)

Benedutch 2011 ew_ppt