The Great Big Data Vault Lie

The Great Big Data Vault Lie

I have recently been running half-day “Introduction to Hook” sessions for a client. When explaining that the Hook approach emerged from difficulties I had experienced with Data Vault, I was asked a question about how to improve the performance of Data Vault queries, given “all those tables”. I was able to restrain myself from answering “well, don’t use Data Vault”, but the question got me thinking about performance claims evangelised by Data Vault enthusiasts, and I realised “the big lie”.

We are told that loading data into a DV2 structure is “frictionless” and that we can load hub, link and satellite tables in parallel. The explanation goes something like this. When loading a source table into the raw vault, we will target one hub table and a satellite table, or multiple hub tables, a link, and a satellite table. Can we load these tables in parallel? Yes, we can. We can simultaneously load data into hub, link, and satellite tables without any contention, as there are no enforced referential integrity constraints between the tables. We say that any relationships are “eventually consistent”, so the order in which we load them is immaterial; we can choose to load them simultaneously if we wish.

It appears that we can load data in parallel with zero contention, but that isn’t entirely true. These sneaky evangelists have performed a sleight-of-hand, tricking you into thinking about the problem in a certain way, so allow me to offer an alternative perspective.

Rather than thinking about loading a single source table, think about loading 100 source tables. Can we load all of them in parallel? Let’s keep this simple and assume that all 100 tables contain a business key for a single hub (let’s say Customer). Therefore, we can load each of these 100 source tables to a separate hub satellite attached to the Customer hub. Think of the process of loading a single table:

  1. Get a distinct list of business keys from the source table.
  2. Check which of those business keys do not already exist in the hub table.
  3. Insert new business keys into the hub.
  4. Insert source data into the satellite table.

The question is, can we load new business keys from different source tables into the Customer hub at the same time? No, we can’t. We must load the source tables sequentially, as we can only determine if there are new business keys one source table at a time, because the results in step 2 of the process require exclusive access to the current list of business keys.

Imagine the complexity if we need to hit multiple hubs and link tables. We now need to devise creative solutions to prevent contention if we want to achieve any degree of parallelism and avoid deadlock or orchestration issues.

Of course, Hook does not suffer these constraints. Each target frame can be loaded entirely independently of any other frame. No contention, no risk of deadlocks, or the need for uniqueness checks. Just a straight INSERT INTO. Loading data to a Hook warehouse is as frictionless as it gets. You are more likely to need to throttle the throughput to avoid overwhelming your DBMS with multiple concurrent insert queries, but that is an engineering decision, not a modelling concern.

If you would like to learn more about Hook, please don’t hesitate to reach out. I am available to run half-day introduction sessions at your convenience.

Fábio de Salles

Business Geek/Data Leader

1mo

So the "great lie" you refer to is just wrong presumptions on the "DV promises" phrasing? Wouldn't it be more elegant and productive if you pointed - to Dan himself - where the Data Vault argument needs some more detail? (Not that I believe it needs, but thinking of you position reporting on your difficulties with DV, that is what I would do in your place.) After all, there are lots of things you can load in parallel with even a modest RDBMS Data Vault. For instance, you can load all sources for a single hub at once if you collate them before running the load and you can run the same pattern for all hubs, from all sources - that is a lot of parallel loading. Besides that, you can load a hub and its satellite simultaneously from a single source because of hash keys. Yet also besides this, you (usually) don't load satellites from different systems into the same satellite table (you have one for each source system in the raw DV) which would allow to load all if not most sources satellites for a single hub simultaneously. The same point goes validy for links and links' satellites. I failed to see the big lie. It seems to me, however, you are trying to promote you product by attacking the strongest method available today for EDW.

Like
Reply
Paul Johnson

Founder & Technical Lead | Data Engineering | SQL Performance | Scalable Analytics

1mo

“We now need to devise creative solutions to prevent contention if we want to achieve any degree of parallelism and avoid deadlock or orchestration issues.” ETL schedule design 101?

Like
Reply

Feathers are ruffled! 😅

Pugazendhi A.

Sudar.io - Extreme Automation of Enterprise Data Integration (DV 2.0) , AI data foundation and data products A first in 1+ ways!

1mo

If such a one in million situation needs to be projected so big to make a case for Hook, that isn’t selling Hook well enough.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics